About TimothySpann

TimothySpann · ‎02-06-2017

ExtractText NiFi Custom Processor Powered by Apache Tika Apache Tika is amazing, it is very easy to use it to analyze file and then to extract text with it. Apache Tika uses other powerful Apache projects like Apache PDFBox and Apache POI. Example Usage Feed in documents, I use my LinkProcessor which grabs links from a website and returns a JSON List. Split the resulting JSON list into individual JSON rows with SplitJSON. EvaluateJSONPath to extract just the URLs. InvokeHTTP to do a GET on that parsed URL. RouteOnAttribute to only process file types I am interested in like Microsoft Word. The new ExtractTextProcessor to extract the text of the document. Then we save the text as a file in some data store, perhaps HDFS. If you have a directory of files, you can just use GetFile to ingest them en masse. LinkProcessor (https://github.com/tspannhw/linkextractorprocessor) URL: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/index.html This is an example of a URL that I want to grab all the documents from. You can point it at any URL that has links to documents (HTML, Word, Excel, PowerPoint, etc...). RouteOnAttribute I only want to process a few types of files, so I limit them here. ${filename:endsWith('.doc'):or(${filename:endsWith('.pdf')}):or(${filename:endsWith('.rtf')}):or(${filename:endsWith('.ppt')}):or( ${filename:endsWith('.docx')}):or(${filename:endsWith('.pptx')}):or(${filename:endsWith('.html')}):or(${filename:endsWith('.htm')}):or(${filename:endsWith('.xls')}):or( ${filename:endsWith('.xlsx')}):or(${filename:endsWith('.xml')}):or(${Content-Type:contains('text/html')}):or(${Content-Type:contains('application/pdf')}):or( ${Content-Type:contains('application/msword')}):or(${Content-Type:contains('application/vnd')}):or(${Content-Type:contains('text/xml')})} Release: https://github.com/tspannhw/nifi-extracttext-processor/releases/tag/1.0 Reference: https://tika.apache.org/ https://tika.apache.org/1.14/formats.html http://pdfbox.apache.org/ https://pdfbox.apache.org/1.8/cookbook/documentcreation.html http://poi.apache.org/ https://community.hortonworks.com/repos/81693/nifi-custom-processor-for-extracting-text-from-doc.html?shortDescriptionMaxLength=140 https://dzone.com/articles/cool-projects-big-data-machine-learning-apache-nifi

TimothySpann · ‎02-03-2017

Sentiment CoreNLP Processor [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse[pool-1-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator sentimentFILE:Header,Header2,Header3Value,Value2,Value3Value4,Value5,Value6Attribute: {"names":"NEGATIVE"} Service Source Code JUnit Test for Processor To Add Sentiment Analysis to Your NiFi Data Flow, just add the custom processor, CoreNLPProcessor. You can downloada pre-built NAR from the github listed below. Add to your NiFi/lib directory and restart each node. The results of the run will be an attribute named sentiment: You can see how easy it is to add to your dataflows. If you would like to add more features to this processor, please fork the github below. This is not an official NiFi processor, just one I wrote in a couple of hours for my own use and for testing. There are four easy ways to add Sentiment Analysis to your Big Data pipelines: executescript of Python NLP scripts, call my custom processor, make a REST call to a Stanford CoreNLP sentiment server, make a REST call to a public sentiment as a service and send a message via Kafka (or JMS) to Spark or Storm to run other JVM sentiment analysis tools. Download a release https://github.com/tspannhw/nifi-corenlp-processor/releases/tag/v1.0 sentimentanalysiscustomprocessor.xml http://stanfordnlp.github.io/CoreNLP https://github.com/tspannhw/neural-sentiment https://github.com/tspannhw/nlp-utilities https://community.hortonworks.com/content/kbentry/81222/adding-stanford-corenlp-to-big-data-pipelines-apac.html https://community.hortonworks.com/content/repo/81187/nifi-corenlp-processor-example-processor-for-doing.html https://community.hortonworks.com/repos/79537/various-utilities-and-examples-for-working-with-va.html https://community.hortonworks.com/articles/76935/using-sentiment-analysis-and-nlp-tools-with-hdp-25.html https://community.hortonworks.com/questions/20791/sentiment-analysis-with-hdp.html https://community.hortonworks.com/articles/30213/us-presidential-election-tweet-analysis-using-hdfn.html https://community.hortonworks.com/articles/52415/processing-social-media-feeds-in-stream-with-apach.html https://community.hortonworks.com/articles/81222/adding-stanford-corenlp-to-big-data-pipelines-apac.html https://community.hortonworks.com/content/kbentry/67983/apache-hive-with-apache-hivemall.html

TimothySpann · ‎02-02-2017

Using Stanford CoreNLP in Your Big Data Pipelines CoreNLP Overview The latest version of Stanford CoreNLP includes a server that you can run and access via REST API. CoreNLP adds a lot of features, but the one most interesting to me is Sentiment Analysis. Installation and Setup (http://stanfordnlp.github.io/CoreNLP/corenlp-server.html) Download a recent full deployment (http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip) This is big, it has models and all the JARS and server code. Run the Server Giving the JVM Four Gigs of RAM to run makes it run nice. Port 9000 works for me. java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 Running the Server stanford-corenlp-full-2016-10-31 git:(master) ✗ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called --- [main] INFO CoreNLP - setting default constituency parser [main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz [main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead [main] INFO CoreNLP - to use shift reduce parser download English models jar from: [main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html [main] INFO CoreNLP - Threads: 8 [main] INFO CoreNLP - Starting server... [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000 [pool-1-thread-6] INFO CoreNLP - [/0:0:0:0:0:0:0:1:59705] API call w/annotators tokenize,ssplit,parse,pos,sentiment The quick brown fox jumped over the lazy dog. [pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [pool-1-thread-6] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer. [pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse [pool-1-thread-6] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec]. [pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos [pool-1-thread-6] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec]. [pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator sentiment [pool-1-thread-8] INFO CoreNLP - [/0:0:0:0:0:0:0:1:59706] API call w/annotators tokenize,ssplit,pos,parse,sentiment The quick brown fox jumped over the lazy dog. [pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos [pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse [pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator sentiment [pool-1-thread-2] INFO CoreNLP - [/0:0:0:0:0:0:0:1:59709] API call w/annotators tokenize,ssplit,pos,parse,sentiment This is the worst way to test sentiment ever. Testing Your Installation You can call the Stanford Server via wget and curl. I like these properties: tokenize, ssplit, parse, sentiment. curl --data 'This is greatest test ever.' 'http://localhost:9000/?properties={%22annotators%22%3A%22sentiment%22%2C%22outputFormat%22%3A%22json%22}' -o - I am running an instance of the server locally, you can run this on an edge node in your cluster. wget --post-data 'This is the worst way to test sentiment ever.' 'localhost:9000/?properties={"annotators":"sentiment","outputFormat":"json"}' -O - --2017-02-02 12:13:51-- http://localhost:9000/?properties=%7B%22annotators%22:%22sentiment%22,%22outputFormat%22:%22json%22%7D Resolving localhost... ::1, 127.0.0.1 Connecting to localhost|::1|:9000... connected. HTTP request sent, awaiting response... 200 OK Length: 4407 (4.3K) [application/json] Saving to: 'STDOUT' - 0%[ ] 0 --.-KB/s {"sentences":[{"index":0,"parse":"(ROOT\n (S\n (NP (DT This))\n (VP (VBZ is)\n (NP\n (NP (DT the) (JJS worst) (NN way))\n (PP (TO to)\n (NP (NN test) (NN sentiment))))\n (ADVP (RB ever)))\n (. .)))","basicDependencies":[{"dep":"ROOT","governor":0,"governorGloss":"ROOT","dependent":5,"dependentGloss":"way"},{"dep":"nsubj","governor":5,"governorGloss":"way","dependent":1,"dependentGloss":"This"},{"dep":"cop","governor":5,"governorGloss":"way","dependent":2,"dependentGloss":"is"},{"dep":"det","governor":5,"governorGloss":"way","dependent":3,"dependentGloss":"the"},{"dep":"amod","governor":5,"governorGloss":"way","dependent":4,"dependentGloss":"worst"},{"dep":"case","governor":8,"governorGloss":"sentiment","dependent":6,"dependentGloss":"to"},{"dep":"compound","governor":8,"governorGloss":"sentiment","dependent":7,"dependentGloss":"test"},{"dep":"nmod","governor":5,"governorGloss":"way","dependent":8,"dependentGloss":"sentiment"},{"dep":"advmod","governor":5,"governorGloss":"way","dependent":9,"dependentGloss":"ever"},{"dep":"punct","governor":5,"governorGloss":"way","dependent":10,"dependentGloss":"."}],"enhancedDependencies":[{"dep":"ROOT","governor":0,"governorGloss":"ROOT","dependent":5,"dependentGloss":"way"},{"dep":"nsubj","governor":5,"governorGloss":"way","dependent":1,"dependentGloss":"This"},{"dep":"cop","governor":5,"governorGloss":"way","dependent":2,"dependentGloss":"is"},{"dep":"det","governor":5,"governorGloss":"way","dependent":3,"dependentGloss":"the"},{"dep":"amod","governor":5,"governorGloss":"way","dependent":4,"dependentGloss":"worst"},{"dep":"case","governor":8,"governorGloss":"sentiment","dependent":6,"dependentGloss":"to"},{"dep":"compound","governor":8,"governorGloss":"sentiment","dependent":7,"dependentGloss":"test"},{"dep":"nmod:to","governor":5,"governorGloss":"way","dependent":8,"dependentGloss":"sentiment"},{"dep":"advmod","governor":5,"governorGloss":"way","dependent":9,"dependentGloss":"ever"},{"dep":"punct","governor":5,"governorGloss":"way","dependent":10,"dependentGloss":"."}],"enhancedPlusPlusDependencies":[{"dep":"ROOT","governor":0,"governorGloss":"ROOT","dependent":5,"dependentGloss":"way"},{"dep":"nsubj","governor":5,"governorGloss":"way","dependent":1,"dependentGloss":"This"},{"dep":"cop","governor":5,"governorGloss":"way","dependent":2,"dependentGloss":"is"},{"dep":"det","governor":5,"governorGloss":"way","dependent":3,"dependentGloss":"the"},{"dep":"amod","governor":5,"governorGloss":"way","dependent":4,"dependentGloss":"worst"},{"dep":"case","governor":8,"governorGloss":"sentiment","dependent":6,"dependentGloss":"to"},{"dep":"compound","governor":8,"governorGloss":"sentiment","dependent":7,"dependentGloss":"test"},{"dep":"nmod:to","governor":5,"governorGloss":"way","dependent":8,"dependentGloss":"sentiment"},{"dep":"advmod","governor":5,"governorGloss":"way","dependent":9,"dependentGloss":"ever"},{"dep":"punct","governor":5,"governorGloss":"way","dependent":10,"dependentGloss":"."}],"sentimentValue":"0","sentiment":"Verynegative","tokens":[{"index":1,"word":"This","originalText":"This","characterOffsetBegin":0,"characterOffsetEnd":4,"pos":"DT","before":"","after":" "},{"index":2,"word":"is","originalText":"is","characterOffsetBegin":5,"characterOffsetEnd":7,"pos":"VBZ","before":" ","after":" "},{"index":3,"word":"the","originalText":"the","characterOffsetBegin":8,"characterOffsetEnd":11,"pos":"DT","before":" ","after":" "},{"index":4,"word":"worst","originalText":"worst","characterOffsetBegin":12,"characterOffsetEnd":17,"pos":"JJS","before":" ","after":" "},{"index":5,"word":"way","originalText":"way","characterOffsetBegin":18,"characterOffsetEnd":21,"pos":"NN","before":" ","after":" "},{"index":6,"word":"to","originalText":"to","characterOffsetBegin":22,"characterOffsetEnd":24,"pos":"TO","before":" ","after":" "},{"index":7,"word":"test","originalText":"test","characterOffsetBegin":25,"characterOffsetEnd":29,"pos":"NN","before":" ","after":" "},{"index":8,"word":"sentiment","originalText":"sentiment","characterOffsetBegin":30,"characterOffsetEnd":39,"pos":"NN","before":" ","after":" "},{"index":9,"word":"ever","originalText":"ever","characterOffsetBegin":40,"characterOffsetEnd":44,"- 100%[==============================================================================================================>] 4.30K --.-KB/s in 0s The tool gives you a ton of data on how it ran it's NLP analysis as well as giving you back your sentiment results. You can configure different properties for different language processing. This is well documented by Stanford. Stanford CoreNLP Server UI You not only get a REST API, you also get a nice front-end Accessing From Apache NiFi Step 1: Get some Data (GetTwitter works nice) Step 2: Build a File with just 1 field to send (I extract the Twitter message and then convert that to a FlowFile with no JSON) Step 3: InvokeHTTP to call Sentiment Server http://localhost:9000/?properties=%7B%22annotators%22%3A%22tokenize%2Cssplit%2Cparse%2Csentiment%22%2C%22outputFormat%22%3A%22json%22%7D Make sure you set Content-Type to application/json, set the Message Body to True, Always output Response to true, Follow Redirects to true, and HTTP Method to POST. Step 4: Use the JSON NLP Results The server will also allow you to receive text and XML. JSON is easy to work with. { "sentences" : [ { "index" : 0, "parse" : "(ROOT\n (FRAG\n (NP (NNP RT) (NNP @MikeTamir))\n (: :)\n (S\n (NP (NNP Google))\n (VP (VBG betting)\n (ADJP (JJ big)\n (PP (IN on)\n (S\n (VP (VBG #DeepLearning)\n (NP\n (NP (JJ #AI) (VBG #MachineLearning) (NN #DataScience))\n (: :)\n (NP (NNP Sundar) (NNP Pichai) (NNPS https://t.co/r5X4AnhXUo) (NNP https://t.co/c…)))))))))))", "basicDependencies" : [ { "dep" : "ROOT", "governor" : 0, "governorGloss" : "ROOT", "dependent" : 2, "dependentGloss" : "@MikeTamir" }, { "dep" : "compound", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 1, "dependentGloss" : "RT" }, { "dep" : "punct", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 3, "dependentGloss" : ":" }, { "dep" : "nsubj", "governor" : 5, "governorGloss" : "betting", "dependent" : 4, "dependentGloss" : "Google" }, { "dep" : "parataxis", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 5, "dependentGloss" : "betting" }, { "dep" : "xcomp", "governor" : 5, "governorGloss" : "betting", "dependent" : 6, "dependentGloss" : "big" }, { "dep" : "mark", "governor" : 8, "governorGloss" : "#DeepLearning", "dependent" : 7, "dependentGloss" : "on" }, { "dep" : "advcl", "governor" : 6, "governorGloss" : "big", "dependent" : 8, "dependentGloss" : "#DeepLearning" }, { "dep" : "amod", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 9, "dependentGloss" : "#AI" }, { "dep" : "amod", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 10, "dependentGloss" : "#MachineLearning" }, { "dep" : "dobj", "governor" : 8, "governorGloss" : "#DeepLearning", "dependent" : 11, "dependentGloss" : "#DataScience" }, { "dep" : "punct", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 12, "dependentGloss" : ":" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 13, "dependentGloss" : "Sundar" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 14, "dependentGloss" : "Pichai" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 15, "dependentGloss" : "https://t.co/r5X4AnhXUo" }, { "dep" : "dep", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 16, "dependentGloss" : "https://t.co/c…" } ], "enhancedDependencies" : [ { "dep" : "ROOT", "governor" : 0, "governorGloss" : "ROOT", "dependent" : 2, "dependentGloss" : "@MikeTamir" }, { "dep" : "compound", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 1, "dependentGloss" : "RT" }, { "dep" : "punct", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 3, "dependentGloss" : ":" }, { "dep" : "nsubj", "governor" : 5, "governorGloss" : "betting", "dependent" : 4, "dependentGloss" : "Google" }, { "dep" : "parataxis", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 5, "dependentGloss" : "betting" }, { "dep" : "xcomp", "governor" : 5, "governorGloss" : "betting", "dependent" : 6, "dependentGloss" : "big" }, { "dep" : "mark", "governor" : 8, "governorGloss" : "#DeepLearning", "dependent" : 7, "dependentGloss" : "on" }, { "dep" : "advcl:on", "governor" : 6, "governorGloss" : "big", "dependent" : 8, "dependentGloss" : "#DeepLearning" }, { "dep" : "amod", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 9, "dependentGloss" : "#AI" }, { "dep" : "amod", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 10, "dependentGloss" : "#MachineLearning" }, { "dep" : "dobj", "governor" : 8, "governorGloss" : "#DeepLearning", "dependent" : 11, "dependentGloss" : "#DataScience" }, { "dep" : "punct", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 12, "dependentGloss" : ":" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 13, "dependentGloss" : "Sundar" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 14, "dependentGloss" : "Pichai" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 15, "dependentGloss" : "https://t.co/r5X4AnhXUo" }, { "dep" : "dep", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 16, "dependentGloss" : "https://t.co/c…" } ], "enhancedPlusPlusDependencies" : [ { "dep" : "ROOT", "governor" : 0, "governorGloss" : "ROOT", "dependent" : 2, "dependentGloss" : "@MikeTamir" }, { "dep" : "compound", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 1, "dependentGloss" : "RT" }, { "dep" : "punct", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 3, "dependentGloss" : ":" }, { "dep" : "nsubj", "governor" : 5, "governorGloss" : "betting", "dependent" : 4, "dependentGloss" : "Google" }, { "dep" : "parataxis", "governor" : 2, "governorGloss" : "@MikeTamir", "dependent" : 5, "dependentGloss" : "betting" }, { "dep" : "xcomp", "governor" : 5, "governorGloss" : "betting", "dependent" : 6, "dependentGloss" : "big" }, { "dep" : "mark", "governor" : 8, "governorGloss" : "#DeepLearning", "dependent" : 7, "dependentGloss" : "on" }, { "dep" : "advcl:on", "governor" : 6, "governorGloss" : "big", "dependent" : 8, "dependentGloss" : "#DeepLearning" }, { "dep" : "amod", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 9, "dependentGloss" : "#AI" }, { "dep" : "amod", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 10, "dependentGloss" : "#MachineLearning" }, { "dep" : "dobj", "governor" : 8, "governorGloss" : "#DeepLearning", "dependent" : 11, "dependentGloss" : "#DataScience" }, { "dep" : "punct", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 12, "dependentGloss" : ":" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 13, "dependentGloss" : "Sundar" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 14, "dependentGloss" : "Pichai" }, { "dep" : "compound", "governor" : 16, "governorGloss" : "https://t.co/c…", "dependent" : 15, "dependentGloss" : "https://t.co/r5X4AnhXUo" }, { "dep" : "dep", "governor" : 11, "governorGloss" : "#DataScience", "dependent" : 16, "dependentGloss" : "https://t.co/c…" } ], "sentimentValue" : "1", "sentiment" : "Negative", "tokens" : [ { "index" : 1, "word" : "RT", "originalText" : "RT", "characterOffsetBegin" : 0, "characterOffsetEnd" : 2, "pos" : "NN", "before" : "", "after" : " " }, { "index" : 2, "word" : "@MikeTamir", "originalText" : "@MikeTamir", "characterOffsetBegin" : 3, "characterOffsetEnd" : 13, "pos" : "NN", "before" : " ", "after" : "" }, { "index" : 3, "word" : ":", "originalText" : ":", "characterOffsetBegin" : 13, "characterOffsetEnd" : 14, "pos" : ":", "before" : "", "after" : " " }, { "index" : 4, "word" : "Google", "originalText" : "Google", "characterOffsetBegin" : 15, "characterOffsetEnd" : 21, "pos" : "NNP", "before" : " ", "after" : " " }, { "index" : 5, "word" : "betting", "originalText" : "betting", "characterOffsetBegin" : 22, "characterOffsetEnd" : 29, "pos" : "VBG", "before" : " ", "after" : " " }, { "index" : 6, "word" : "big", "originalText" : "big", "characterOffsetBegin" : 30, "characterOffsetEnd" : 33, "pos" : "JJ", "before" : " ", "after" : " " }, { "index" : 7, "word" : "on", "originalText" : "on", "characterOffsetBegin" : 34, "characterOffsetEnd" : 36, "pos" : "IN", "before" : " ", "after" : " " }, { "index" : 8, "word" : "#DeepLearning", "originalText" : "#DeepLearning", "characterOffsetBegin" : 37, "characterOffsetEnd" : 50, "pos" : "NN", "before" : " ", "after" : " " }, { "index" : 9, "word" : "#AI", "originalText" : "#AI", "characterOffsetBegin" : 51, "characterOffsetEnd" : 54, "pos" : "NN", "before" : " ", "after" : " " }, { "index" : 10, "word" : "#MachineLearning", "originalText" : "#MachineLearning", "characterOffsetBegin" : 55, "characterOffsetEnd" : 71, "pos" : "NN", "before" : " ", "after" : " " }, { "index" : 11, "word" : "#DataScience", "originalText" : "#DataScience", "characterOffsetBegin" : 72, "characterOffsetEnd" : 84, "pos" : "NN", "before" : " ", "after" : " " }, { "index" : 12, "word" : ":", "originalText" : ":", "characterOffsetBegin" : 85, "characterOffsetEnd" : 86, "pos" : ":", "before" : " ", "after" : " " }, { "index" : 13, "word" : "Sundar", "originalText" : "Sundar", "characterOffsetBegin" : 87, "characterOffsetEnd" : 93, "pos" : "NNP", "before" : " ", "after" : " " }, { "index" : 14, "word" : "Pichai", "originalText" : "Pichai", "characterOffsetBegin" : 94, "characterOffsetEnd" : 100, "pos" : "NNP", "before" : " ", "after" : " " }, { "index" : 15, "word" : "https://t.co/r5X4AnhXUo", "originalText" : "https://t.co/r5X4AnhXUo", "characterOffsetBegin" : 101, "characterOffsetEnd" : 124, "pos" : "NN", "before" : " ", "after" : " " }, { "index" : 16, "word" : "https://t.co/c…", "originalText" : "https://t.co/c…", "characterOffsetBegin" : 125, "characterOffsetEnd" : 140, "pos" : "NN", "before" : " ", "after" : "" } ] } ] } Reference: Another simple option for Sentiment Analysis and NLP integration is to use Apache NiFi's ExecuteScript to call various Python libraries. This is well documented here: https://community.hortonworks.com/articles/76935/using-sentiment-analysis-and-nlp-tools-with-hdp-25.html http://stanfordnlp.github.io/CoreNLP/ https://github.com/stanfordnlp/CoreNLP/ http://stanfordnlp.github.io/CoreNLP/download.html

TimothySpann · ‎02-02-2017

Dates are always tricky. You need to make sure that the conversion to JSON and to SQL is getting the correct date format. You are getting a number format exception https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java Either the table has the wrong type or it's converting it to a number. Check the logs and data provenance. http://apache-nifi.1125220.n5.nabble.com/Failure-to-insert-update-into-SQL-integer-field-td13054.html See: https://community.hortonworks.com/questions/48905/date-problems-with-convertjsontosql-or-putsql-in-n.html https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data-provenance

TimothySpann · ‎01-30-2017

2017-01-30 16:38:39,425 WARN [Timer-Driven Process Thread-9] o.a.n.c.t.ContinuallyRunProcessorTask java.lang.IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.start(Matcher.java:375) ~[na:1.8.0_77] at java.util.regex.Matcher.appendReplacement(Matcher.java:880) ~[na:1.8.0_77] at java.util.regex.Matcher.replaceAll(Matcher.java:955) ~[na:1.8.0_77] at java.lang.String.replaceAll(String.java:2223) ~[na:1.8.0_77] at org.apache.nifi.processors.standard.ReplaceText$RegexReplace.replace(ReplaceText.java:518) ~[na:na] at org.apache.nifi.processors.standard.ReplaceText.onTrigger(ReplaceText.java:263) ~[na:na] at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) ~[nifi-api-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099) ~[nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136) [nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47) [nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132) [nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]

TimothySpann · ‎01-30-2017

https://zeppelin.apache.org/ is great for Python and Spark and is included in HDP 2.5 Here is a list of other options: https://wiki.python.org/moin/IntegratedDevelopmentEnvironments http://www.pydev.org/ on Eclipse is nice http://www.scintilla.org/ https://github.com/spyder-ide/spyder http://ninja-ide.org/ Windows Users http://pytools.codeplex.com/ General text editors like TextWrangle and Sublime Text are good for Python. Built-in http://idlex.sourceforge.net/ I like using VI and Zeppelin for PySpark

TimothySpann · ‎01-30-2017

Open NLP Example Apache NiFi Processor I wanted to be able to add NLP processing to my dataflow without calling to Apache Spark jobs or other disconnected ways. A custom processor let's me write fast Java 8 microservices to process functionality in my stream in a concise way. All the source code for this processor is available with the Apache license in github. So I wrote one. See the attached generated HTML documentation for the processor. If you would like to use this processor. git clone https://github.com/tspannhw/nifi-nlp-processor mvn package cp nifi-nlp-nar/target/nifi-nlp-nar-1.0.nar /usr/hdf/current/nifi/lib/ You can also download a prebuilt NAR from github. Then restart NIFI via Ambari and you can start using it. This has been tested for HDF 2.x NiFi. Add the NLP Processor. Then set the properties, you need to set sentence that you want parsed. You can use expression language to grab a field from an attribute like I am doing to grab the Tweet. Send it a sentence, say from Twitter and you will get back. You need to set the Extra Resources to a directory where you have downloaded the Apache OpenNLP prebuilt models referenced below. Results Two attributes get added to your flow. They contain JSON arrays of locations and names extracted from your sentence (or page of text). Locations {"locations":[{"location":"Sydney"}]} Names {"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]} Entities extracted from the text using Apache OpenNLP via a custom NiFi Processor. Current Version Uses (Apache OpenNLP Pre-built Models v1.5) en-token.bin en-ner-person.bin en-ner-location.bin You can add other languages and models as enhancements. If you would like to extend the processor, it includes a JUnit test for you to run and extend. If uses the NiFi TestRunner and will allow you to see the flowfile, set inputs and get outputs. Note: The current version supports English only, if you want to extend it, please fork the project and I will merge code in. References: Models to Download and install to /usr/hdf/current/nifi/lib/ http://opennlp.sourceforge.net/models-1.5/ https://community.hortonworks.com/articles/76240/using-opennlp-for-identifying-names-from-text.html twittertonlp.xml

TimothySpann · ‎01-30-2017

Working with airbnb's Superset This is a very cool open source Analytics platform based on some cool Python. I installed this on a CentOS 7 edge node. sudo yum upgrade python-setuptools sudo yum install gcc libffi-devel python-devel python-pip python-wheel openssl-devel libsasl2-devel openldap-devel pip install virtualenv virtualenv venv. ./venv/bin/activate pip install --upgrade setuptools pip pip install mysqlclient pip install pyhive pip install superset fabmanager create-admin --app superset 2017-01-27 18:15:37,864:INFO:flask_appbuilder.security.sqla.manager:Created Permission View: menu access on Query Search2017-01-27 18:15:37,885:INFO:flask_appbuilder.security.sqla.manager:Added Permission menu access on Query Search to role AdminRecognized Database Authentications.2017-01-27 18:15:37,907:INFO:flask_appbuilder.security.sqla.manager:Added user admin Admin User admin created. superset db upgrade superset load_examples superset init superset runserver -p 8088 The main things you will need are Python Browse to http://yourservername:8088/ and start running querys, building charts and reports. It does a lot of things that commercial reporting tools do, but fully open source. Superset + Zeppelin + CLI + ODBC + JDBC give me all the access to my Hadoop, Druid, SparkSQL and MariaDB data that I need. This is admin with the password you set in the fabmanager create admin. Browsing tables is easy in the web based platform. The results of running a query which shows the intellisense that suggests table names for you. This was a built-in example report that shows you how powerful and professional reports you can build with this tool. The SQL Lab is a great place to try out queries and examine data. SQL Lab lets you run queries and explore the data. You get quick access to your previous queries and run status. This is a simple report that was autogenerated for me by picking a query on one table. This is your home page that will show you dashboards you have built and recent activity. A very nice github style interface. Reference: http://airbnb.io/superset/installation.html https://pypi.python.org/pypi/PyHive http://airbnb.io/superset/ https://github.com/airbnb/superset http://druid.io

TimothySpann · ‎01-28-2017

Preparing a Raspberry PI to Run TensorFlow Image Recognition I can easily have a Python script that polls my webcam (use official Raspberry Pi webcam) , calls TensorFlow and then sends the results to NiFi via MQTT. You need to install Python MQTT Library (https://pypi.python.org/pypi/paho-mqtt/1.1) For setting up Python, Raspberry PI with Camera, see https://dzone.com/articles/picamera-ingest-real-time Raspberry Pi 3 B+ preparation Buy a good quality 16 GIG SD Card and from OSX, Run SD Formatter to Overwrite Format the device at FAT, download here: https://www.sdcard.org/downloads/formatter_4/. Download the BerryBoot image from here. Unzip it and then copy it to your complete SD card. For examples of RPi TensorFlow You Can Run: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/pi_examples/ You need to build tensorflow for pi, which took me over 4 hours. See: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/makefile https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/pi_examples/ Process: wget https://github.com/tensorflow/tensorflow/archive/master.zipapt-get install -y libjpeg-devcd tensorflow-mastertensorflow/contrib/makefile/download_dependencies.shsudo apt-get install -y autoconf automake libtool gcc-4.8 g++-4.8cd tensorflow/contrib/makefile/downloads/protobuf/./autogen.sh./configuremakesudo make installsudo ldconfig # refresh shared library cachecd ../../../../..make -f tensorflow/contrib/makefile/Makefile HOST_OS=PI TARGET=PI \ OPTFLAGS="-Os -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize" CXX=g++-4.8curl https://storage.googleapis.com/download.tensorflow.org/models/inception_dec_2015_stripped.zip \-o /tmp/inception_dec_2015_stripped.zipunzip /tmp/inception_dec_2015_stripped.zip \-d tensorflow/contrib/pi_examples/label_image/data/make -f tensorflow/contrib/pi_examples/label_image/Makefile root@raspberrypi:/opt/demo/tensorflow-master# tensorflow/contrib/pi_examples/label_image/gen/bin/label_image2017-01-28 01:46:48: I tensorflow/contrib/pi_examples/label_image/label_image.cc:144] Loaded JPEG: 512x600x32017-01-28 01:46:50: W tensorflow/core/framework/op_def_util.cc:332] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().2017-01-28 01:46:52: I tensorflow/contrib/pi_examples/label_image/label_image.cc:378] Running model succeeded!2017-01-28 01:46:52: I tensorflow/contrib/pi_examples/label_image/label_image.cc:272] military uniform (866): 0.6242942017-01-28 01:46:52: I tensorflow/contrib/pi_examples/label_image/label_image.cc:272] suit (794): 0.04739812017-01-28 01:46:52: I tensorflow/contrib/pi_examples/label_image/label_image.cc:272] academic gown (896): 0.02809252017-01-28 01:46:52: I tensorflow/contrib/pi_examples/label_image/label_image.cc:272] bolo tie (940): 0.01569552017-01-28 01:46:52: I tensorflow/contrib/pi_examples/label_image/label_image.cc:272] bearskin (849): 0.0143348 It took over 4 hours to build. But only 4 seconds to run and gave good results for analyzing a picture of Computer Legend Grace Hopper. root@raspberrypi:/opt/demo/tensorflow-master# tensorflow/contrib/pi_examples/label_image/gen/bin/label_image --help2017-01-28 01:51:26: E tensorflow/contrib/pi_examples/label_image/label_image.cc:337]usage: tensorflow/contrib/pi_examples/label_image/gen/bin/label_imageFlags: --image="tensorflow/contrib/pi_examples/label_image/data/grace_hopper.jpg" string image to be processed --graph="tensorflow/contrib/pi_examples/label_image/data/tensorflow_inception_stripped.pb" string graph to be executed --labels="tensorflow/contrib/pi_examples/label_image/data/imagenet_comp_graph_label_strings.txt" string name of file containing labels --input_width=299 int32 resize image to this width in pixels --input_height=299 int32 resize image to this height in pixels --input_mean=128 int32 scale pixel values to this mean --input_std=128 int32 scale pixel values to this std deviation --input_layer="Mul" string name of input layer --output_layer="softmax" string name of output layer --self_test=false bool run a self test --root_dir="" string interpret image and graph file names relative to this directory

TimothySpann · ‎01-27-2017

What version of HDF are you using? can you upgrade? No timestamps in 1.7.7 http://avro.apache.org/docs/1.7.7/spec.html#schema_primitive I seem to remember that being added in 1.8 Here are some AVRO notes I did last year for a meetup: https://github.com/airisdata/avroparquet

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

ExtractText NiFi Custom Processor Powered by Apach...

Adding Stanford CoreNLP To Big Data Pipelines (Apa...

Adding Stanford CoreNLP To Big Data Pipelines (Apa...

Re: Issue with date datatype of PutSQL processor f...

HDF 2.1: Warnings, frequently, can't track down

Re: Do we have any front-end for Python

Open NLP Example Apache NiFi Processor

Working with airbnb's Superset

IoT: Capturing Photos and Analyzing The Image wit...

Re: Best Practice - JSON to Avro, data type preser...