Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Master Guru

I will add the NiFi flow information here tomorrow.

For a rough draft, I wanted to show what it could do as it's pretty cool.

cat all.txt| jq --raw-output '.["text"]' | syntaxnet/demo.sh

From NiFi I collect a stream of Twitter data and send that to a file as JSON (all.txt). There are many ways to parse that, but I am a fan of the simple command line tool JQ which is an awesome tool to parse JSON from the command line and is available for MacOSX and Linux. So from the Twitter feed I just grab the tweet text to parse with Parsey.

Initially I was going to install TensorFlow Syntaxnet (Parsey McParseface) on the HDP 2.4 Sandbox, but Centos 6 and TensorFlow do not play well. So for now the easiest route is to install HDF on a Mac and build Syntaxnet on your Mac.

The install instructions are very detailed, but the build is very particular and very machine intensive. It's best to let the build run and go off and do something else with everything else shutdown (no Chrome, VM, editors, ...).

After running McParseface, here are some results:

Input: RT @ Data_Tactical : Scary and fascinating : The future of big data https : //t.co/uwHoV8E49N # bigdata # datanews # datascience # datacenter https : ...
Parse:
Data_Tactical JJ ROOT
 +-- RT NNP nn
 +-- @ IN nn
 +-- : : punct
 +-- Scary JJ dep
 |   +-- and CC cc
 |   +-- fascinating JJ conj
 +-- future NN dep
 |   +-- The DT det
 |   +-- of IN prep
 |       +-- data NNS pobj
 |           +-- big JJ amod
 +-- https ADD dep
 +-- # $ dep
 |   +-- //t.co/uwHoV8E49N CD num
 |   +-- datanews NNS dep
 |   |   +-- bigdata NNP nn
 |   |   +-- # $ nn
 |   +-- # $ dep
 |   +-- datacenter NN dep
 |   |   +-- # NN nn
 |   |       +-- datascience NN nn
 |   +-- https ADD dep
 +-- ... . punct
INFO:tensorflow:Read 4 documents
Input: u_t=11x^2u_xx+ -LRB- 11x+2t -RRB- u_x+-1u https : //t.co/NHXcebT9XC # trading # bigdata https : //t.co/vOM8S5Ewwq
Parse:
u_t=11x^2u_xx+ LS ROOT
 +-- 11x+2t LS dep
 |   +-- -LRB- -LRB- punct
 |   +-- -RRB- -RRB- punct
 +-- u_x+-1u CD dep
 +-- https ADD dep
     +-- : : punct
     +-- //t.co/vOM8S5Ewwq CD dep
Input: RT @ weloveknowles : When Beyoncé thinks the song is over but the hive has other ideas https : //t.co/0noxKaYveO
Parse:
RT NNP ROOT
 +-- @ IN prep
 |   +-- weloveknowles NNS pobj
 +-- : : punct
 +-- thinks VBZ dep
 |   +-- When WRB advmod
 |   +-- Beyoncé NNP nsubj
 |   +-- is VBZ ccomp
 |   |   +-- song NN nsubj
 |   |   |   +-- the DT det
 |   |   +-- over RB advmod
 |   +-- but CC cc
 |   +-- has VBZ conj
 |       +-- hive NN nsubj
 |       |   +-- the DT det
 |       +-- ideas NNS dobj
 |       |   +-- other JJ amod
 |       +-- https ADD advmod
 +-- //t.co/0noxKaYveO ADD dep
Input: RT @ KirkDBorne : Enabling the # BigData Revolution -- An International # OpenData Roadmap : https : //t.co/e89xNNNkUe # Data4Good HT @ Devbd https : / ...
Parse:
RT NNP ROOT
 +-- @ IN prep
 |   +-- KirkDBorne NNP pobj
 +-- : : punct
 +-- Enabling VBG dep
 |   +-- Revolution NNP dobj
 |       +-- the DT det
 |       +-- # $ nn
 |       +-- BigData NNP nn
 |       +-- -- : punct
 |       +-- Roadmap NNP dep
 |       |   +-- An DT det
 |       |   +-- International NNP nn
 |       |   +-- OpenData NNP nn
 |       |       +-- # NN nn
 |       +-- : : punct
 |       +-- https ADD dep
 |       +-- //t.co/e89xNNNkUe LS dep
 |           +-- @ NN dep
 |               +-- Data4Good CD nn
 |               |   +-- # $ nn
 |               +-- HT FW nn
 |               +-- Devbd NNP dep
 |               +-- https ADD dep
 |               +-- : : punct
 +-- / NFP punct
 +-- ... . punct
Input: RT @ DanielleAlberti : It 's like 10 , 000 bees when all you need is a hive. https : //t.co/ElGLLbykN8
Parse:
RT NNP ROOT
 +-- @ IN prep
 |   +-- DanielleAlberti NNP pobj
 +-- : : punct
 +-- 's VBZ dep
 |   +-- It PRP nsubj
 |   +-- like IN prep
 |   |   +-- 10 CD pobj
 |   +-- , , punct
 |   +-- bees NNS appos
 |       +-- 000 CD num
 |       +-- https ADD rcmod
 |           +-- when WRB advmod
 |           +-- all DT nsubj
 |           |   +-- need VBP rcmod
 |           |       +-- you PRP nsubj
 |           +-- is VBZ cop
 |           +-- a DT det
 |           +-- hive. NN nn
 +-- //t.co/ElGLLbykN8 ADD dep

I am going to wire this up to NiFi to drop these in HDFS for further data analysis in Zeppelin.

The main problems are you need to have very specific versions of Python (2.7), Bazel (0.2.0 - 0.2.2b), Numpy, Protobuf, ASCIITree and others. Some of these don't play well with older versions of Centos. If you are on a clean Mac or Ubuntu, things should go smooth.

My CentOS was missing a bunch of libraries so I tried to install them:

sudo yum -y install swigpip install -U
protobuf==3.0.0b2 pip install asciitreepip install numpyPip install noseWget https://github.com/bazelbuild/bazel/releases/download/0.2.2b/bazel-0.2.2b-installer-linux-x86_64.shs... yum -y install
libstdc++  ./configuresudo yum -y install
pkg-config zip g++ zlib1g-dev unzipcd ..
  bazel test syntaxnet/... util/utf8/...# On Mac, run the following:
  bazel test --linkopt=-headerpad_max_install_names \
  syntaxnet/... util/utf8/…cat /etc/redhat-releaseCentOS release 6.7
(Final)sudo yum -y install
glibcsudo yum
-y install epel_release
sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel
gcc-gfortran openssl-devel libffi-devel
pip install --upgrade virtualenv
virtualenv --system-site-packages ~/venvs/tensorflow
source
~/venvs/tensorflow/bin/activate
pip install --upgrade numpy scipy wheel cryptography #optional
pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
# or below if you want gpu, support, but cuda and cudnn are required, see docs
for more install instructions
pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whlsudo yum -y  install
python-numpy swig python-devsudo yum -y upgradeyum install python27

It's worth a try for the patient or people with newer CentOS. Your mileage may vary!

References:

8,888 Views
Comments

Hey Timothy

Great article and I wanted to thank you for putting it together. Currently I am trying to create a corpus that I will later use to train an RNN article summarizer. I didnt have access to something like gigaword so I wrote an article scraper in Javascript and now I wanted to POS tag the title and body using parsey mcparseface. I have gotten to the point where I can pass in a single input file via the params passed to parser_eval, but my JS scraper is currently outputting a JSON object in a .json file for each article which contains the title, body and some other info. What I am wanting to do is see if there is a way to pass a folder to the params (such as the input field) and have it iterate over all the files in a folder, use Parsey McParseface to POS tag the title and body and then output that in xml. I have pasted the main entry point below. I cant figure out how to modify the "documents".

I figured I would post to see if you have any recommendations on how to go about passing in the data from each of these files. I have been trying to find where in the pipeline I am able to inject / modify the sentences coming in but have not had success yet. If you have any tips or recommendations on how I might be able to accomplish this, send them my way. Otherwise, thanks again for the article! Time to jump back into the API docs 🙂

def main(unused_argv):
  logging.set_verbosity(logging.INFO)

  path_to_json = "%s/tf_files/dataset_raw/nprarticles" % expanduser("~")
  json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

  # we need both the json and an index number so use enumerate()
  for index, js in enumerate(json_files):
      with open(os.path.join(path_to_json, js)) as json_file:
          json_text = json.load(json_file)
          title = json_text['title']
          body = json_text['body']

  with tf.Session() as sess:
    src = gen_parser_ops.document_source(batch_size=32,
                                         corpus_name=FLAGS.corpus_name,
                                         task_context=FLAGS.task_context)

    sentence = sentence_pb2.Sentence()
    l_root = ET.Element("root")
    l_headline = ET.SubElement(l_root, "headline").text = "title"
    l_text = ""
    l_text2 = ET.SubElement(l_root, "text") #sentence.text
    l_sentences = ET.SubElement(l_root, "sentences")
    
    l_numSentences = 0
    while True:
      documents, finished = sess.run(src)
      #logging.info('Read %d documents', len(documents))
      for d in documents:
        sentence.ParseFromString(d)
        l_sentence = ET.SubElement(l_sentences, "sentence", id="%s" % l_numSentences)
        l_tokens = ET.SubElement(l_sentence, "tokens")
        l_text = "%s %s" % (l_text, sentence.text)
       
        #print 'Formatting XML'
        formatXml(sentence, l_tokens)
        l_numSentences += 1