Created on 06-01-2016 01:54 PM
I will add the NiFi flow information here tomorrow.
For a rough draft, I wanted to show what it could do as it's pretty cool.
cat all.txt| jq --raw-output '.["text"]' | syntaxnet/demo.sh
From NiFi I collect a stream of Twitter data and send that to a file as JSON (all.txt). There are many ways to parse that, but I am a fan of the simple command line tool JQ which is an awesome tool to parse JSON from the command line and is available for MacOSX and Linux. So from the Twitter feed I just grab the tweet text to parse with Parsey.
Initially I was going to install TensorFlow Syntaxnet (Parsey McParseface) on the HDP 2.4 Sandbox, but Centos 6 and TensorFlow do not play well. So for now the easiest route is to install HDF on a Mac and build Syntaxnet on your Mac.
The install instructions are very detailed, but the build is very particular and very machine intensive. It's best to let the build run and go off and do something else with everything else shutdown (no Chrome, VM, editors, ...).
After running McParseface, here are some results:
Input: RT @ Data_Tactical : Scary and fascinating : The future of big data https : //t.co/uwHoV8E49N # bigdata # datanews # datascience # datacenter https : ... Parse: Data_Tactical JJ ROOT +-- RT NNP nn +-- @ IN nn +-- : : punct +-- Scary JJ dep | +-- and CC cc | +-- fascinating JJ conj +-- future NN dep | +-- The DT det | +-- of IN prep | +-- data NNS pobj | +-- big JJ amod +-- https ADD dep +-- # $ dep | +-- //t.co/uwHoV8E49N CD num | +-- datanews NNS dep | | +-- bigdata NNP nn | | +-- # $ nn | +-- # $ dep | +-- datacenter NN dep | | +-- # NN nn | | +-- datascience NN nn | +-- https ADD dep +-- ... . punct INFO:tensorflow:Read 4 documents Input: u_t=11x^2u_xx+ -LRB- 11x+2t -RRB- u_x+-1u https : //t.co/NHXcebT9XC # trading # bigdata https : //t.co/vOM8S5Ewwq Parse: u_t=11x^2u_xx+ LS ROOT +-- 11x+2t LS dep | +-- -LRB- -LRB- punct | +-- -RRB- -RRB- punct +-- u_x+-1u CD dep +-- https ADD dep +-- : : punct +-- //t.co/vOM8S5Ewwq CD dep Input: RT @ weloveknowles : When Beyoncé thinks the song is over but the hive has other ideas https : //t.co/0noxKaYveO Parse: RT NNP ROOT +-- @ IN prep | +-- weloveknowles NNS pobj +-- : : punct +-- thinks VBZ dep | +-- When WRB advmod | +-- Beyoncé NNP nsubj | +-- is VBZ ccomp | | +-- song NN nsubj | | | +-- the DT det | | +-- over RB advmod | +-- but CC cc | +-- has VBZ conj | +-- hive NN nsubj | | +-- the DT det | +-- ideas NNS dobj | | +-- other JJ amod | +-- https ADD advmod +-- //t.co/0noxKaYveO ADD dep Input: RT @ KirkDBorne : Enabling the # BigData Revolution -- An International # OpenData Roadmap : https : //t.co/e89xNNNkUe # Data4Good HT @ Devbd https : / ... Parse: RT NNP ROOT +-- @ IN prep | +-- KirkDBorne NNP pobj +-- : : punct +-- Enabling VBG dep | +-- Revolution NNP dobj | +-- the DT det | +-- # $ nn | +-- BigData NNP nn | +-- -- : punct | +-- Roadmap NNP dep | | +-- An DT det | | +-- International NNP nn | | +-- OpenData NNP nn | | +-- # NN nn | +-- : : punct | +-- https ADD dep | +-- //t.co/e89xNNNkUe LS dep | +-- @ NN dep | +-- Data4Good CD nn | | +-- # $ nn | +-- HT FW nn | +-- Devbd NNP dep | +-- https ADD dep | +-- : : punct +-- / NFP punct +-- ... . punct Input: RT @ DanielleAlberti : It 's like 10 , 000 bees when all you need is a hive. https : //t.co/ElGLLbykN8 Parse: RT NNP ROOT +-- @ IN prep | +-- DanielleAlberti NNP pobj +-- : : punct +-- 's VBZ dep | +-- It PRP nsubj | +-- like IN prep | | +-- 10 CD pobj | +-- , , punct | +-- bees NNS appos | +-- 000 CD num | +-- https ADD rcmod | +-- when WRB advmod | +-- all DT nsubj | | +-- need VBP rcmod | | +-- you PRP nsubj | +-- is VBZ cop | +-- a DT det | +-- hive. NN nn +-- //t.co/ElGLLbykN8 ADD dep
I am going to wire this up to NiFi to drop these in HDFS for further data analysis in Zeppelin.
The main problems are you need to have very specific versions of Python (2.7), Bazel (0.2.0 - 0.2.2b), Numpy, Protobuf, ASCIITree and others. Some of these don't play well with older versions of Centos. If you are on a clean Mac or Ubuntu, things should go smooth.
My CentOS was missing a bunch of libraries so I tried to install them:
sudo yum -y install swigpip install -U protobuf==3.0.0b2 pip install asciitreepip install numpyPip install noseWget https://github.com/bazelbuild/bazel/releases/download/0.2.2b/bazel-0.2.2b-installer-linux-x86_64.shs... yum -y install libstdc++ ./configuresudo yum -y install pkg-config zip g++ zlib1g-dev unzipcd .. bazel test syntaxnet/... util/utf8/...# On Mac, run the following: bazel test --linkopt=-headerpad_max_install_names \ syntaxnet/... util/utf8/…cat /etc/redhat-releaseCentOS release 6.7 (Final)sudo yum -y install glibcsudo yum -y install epel_release sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel pip install --upgrade virtualenv virtualenv --system-site-packages ~/venvs/tensorflow source ~/venvs/tensorflow/bin/activate pip install --upgrade numpy scipy wheel cryptography #optional pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl # or below if you want gpu, support, but cuda and cudnn are required, see docs for more install instructions pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whlsudo yum -y install python-numpy swig python-devsudo yum -y upgradeyum install python27
It's worth a try for the patient or people with newer CentOS. Your mileage may vary!