1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2458 | 04-03-2024 06:39 AM | |
| 3806 | 01-12-2024 08:19 AM | |
| 2053 | 12-07-2023 01:49 PM | |
| 3037 | 08-02-2023 07:30 AM | |
| 4156 | 03-29-2023 01:22 PM |
02-03-2017
04:48 AM
5 Kudos
Sentiment CoreNLP Processor [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP -
Adding annotator tokenize[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator
- No tokenizer type provided. Defaulting to PTBTokenizer.[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP -
Adding annotator ssplit[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP -
Adding annotator parse[pool-1-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar
- Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
... done [0.4 sec].[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP -
Adding annotator sentimentFILE:Header,Header2,Header3Value,Value2,Value3Value4,Value5,Value6Attribute: {"names":"NEGATIVE"} Service Source Code JUnit Test for Processor To Add Sentiment Analysis to Your NiFi Data Flow, just add the custom processor, CoreNLPProcessor. You can downloada pre-built NAR from the github listed below. Add to your NiFi/lib directory and restart each node. The results of the run will be an attribute named sentiment: You can see how easy it is to add to your dataflows. If you would like to add more features to this processor, please fork the github below. This is not an official NiFi processor, just one I wrote in a couple of hours for my own use and for testing. There are four easy ways to add Sentiment Analysis to your Big Data pipelines: executescript of Python NLP scripts, call my custom processor, make a REST call to a Stanford CoreNLP sentiment server, make a REST call to a public sentiment as a service and send a message via Kafka (or JMS) to Spark or Storm to run other JVM sentiment analysis tools. Download a release https://github.com/tspannhw/nifi-corenlp-processor/releases/tag/v1.0 sentimentanalysiscustomprocessor.xml http://stanfordnlp.github.io/CoreNLP https://github.com/tspannhw/neural-sentiment https://github.com/tspannhw/nlp-utilities https://community.hortonworks.com/content/kbentry/81222/adding-stanford-corenlp-to-big-data-pipelines-apac.html https://community.hortonworks.com/content/repo/81187/nifi-corenlp-processor-example-processor-for-doing.html https://community.hortonworks.com/repos/79537/various-utilities-and-examples-for-working-with-va.html https://community.hortonworks.com/articles/76935/using-sentiment-analysis-and-nlp-tools-with-hdp-25.html https://community.hortonworks.com/questions/20791/sentiment-analysis-with-hdp.html https://community.hortonworks.com/articles/30213/us-presidential-election-tweet-analysis-using-hdfn.html https://community.hortonworks.com/articles/52415/processing-social-media-feeds-in-stream-with-apach.html https://community.hortonworks.com/articles/81222/adding-stanford-corenlp-to-big-data-pipelines-apac.html https://community.hortonworks.com/content/kbentry/67983/apache-hive-with-apache-hivemall.html
... View more
Labels:
02-02-2017
09:24 PM
3 Kudos
Using Stanford CoreNLP in Your Big Data Pipelines CoreNLP Overview The latest version of Stanford CoreNLP includes a server that you can run and access via REST API. CoreNLP adds a lot of features, but the one most interesting to me is Sentiment Analysis. Installation and Setup (http://stanfordnlp.github.io/CoreNLP/corenlp-server.html) Download a recent full deployment (http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip) This is big, it has models and all the JARS and server code. Run the Server Giving the JVM Four Gigs of RAM to run makes it run nice. Port 9000 works for me. java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
Running the Server stanford-corenlp-full-2016-10-31 git:(master) ✗ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
[main] INFO CoreNLP - to use shift reduce parser download English models jar from:
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
[main] INFO CoreNLP - Threads: 8
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
[pool-1-thread-6] INFO CoreNLP - [/0:0:0:0:0:0:0:1:59705] API call w/annotators tokenize,ssplit,parse,pos,sentiment
The quick brown fox jumped over the lazy dog.
[pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-6] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-6] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec].
[pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-6] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
[pool-1-thread-6] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator sentiment
[pool-1-thread-8] INFO CoreNLP - [/0:0:0:0:0:0:0:1:59706] API call w/annotators tokenize,ssplit,pos,parse,sentiment
The quick brown fox jumped over the lazy dog.
[pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator sentiment
[pool-1-thread-2] INFO CoreNLP - [/0:0:0:0:0:0:0:1:59709] API call w/annotators tokenize,ssplit,pos,parse,sentiment
This is the worst way to test sentiment ever.
Testing Your Installation You can call the Stanford Server via wget and curl. I like these properties: tokenize, ssplit, parse, sentiment. curl --data 'This is greatest test ever.' 'http://localhost:9000/?properties={%22annotators%22%3A%22sentiment%22%2C%22outputFormat%22%3A%22json%22}' -o - I am running an instance of the server locally, you can run this on an edge node in your cluster. wget --post-data 'This is the worst way to test sentiment ever.' 'localhost:9000/?properties={"annotators":"sentiment","outputFormat":"json"}' -O -
--2017-02-02 12:13:51-- http://localhost:9000/?properties=%7B%22annotators%22:%22sentiment%22,%22outputFormat%22:%22json%22%7D
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:9000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4407 (4.3K) [application/json]
Saving to: 'STDOUT'
- 0%[ ] 0 --.-KB/s {"sentences":[{"index":0,"parse":"(ROOT\n (S\n (NP (DT This))\n (VP (VBZ is)\n (NP\n (NP (DT the) (JJS worst) (NN way))\n (PP (TO to)\n (NP (NN test) (NN sentiment))))\n (ADVP (RB ever)))\n (. .)))","basicDependencies":[{"dep":"ROOT","governor":0,"governorGloss":"ROOT","dependent":5,"dependentGloss":"way"},{"dep":"nsubj","governor":5,"governorGloss":"way","dependent":1,"dependentGloss":"This"},{"dep":"cop","governor":5,"governorGloss":"way","dependent":2,"dependentGloss":"is"},{"dep":"det","governor":5,"governorGloss":"way","dependent":3,"dependentGloss":"the"},{"dep":"amod","governor":5,"governorGloss":"way","dependent":4,"dependentGloss":"worst"},{"dep":"case","governor":8,"governorGloss":"sentiment","dependent":6,"dependentGloss":"to"},{"dep":"compound","governor":8,"governorGloss":"sentiment","dependent":7,"dependentGloss":"test"},{"dep":"nmod","governor":5,"governorGloss":"way","dependent":8,"dependentGloss":"sentiment"},{"dep":"advmod","governor":5,"governorGloss":"way","dependent":9,"dependentGloss":"ever"},{"dep":"punct","governor":5,"governorGloss":"way","dependent":10,"dependentGloss":"."}],"enhancedDependencies":[{"dep":"ROOT","governor":0,"governorGloss":"ROOT","dependent":5,"dependentGloss":"way"},{"dep":"nsubj","governor":5,"governorGloss":"way","dependent":1,"dependentGloss":"This"},{"dep":"cop","governor":5,"governorGloss":"way","dependent":2,"dependentGloss":"is"},{"dep":"det","governor":5,"governorGloss":"way","dependent":3,"dependentGloss":"the"},{"dep":"amod","governor":5,"governorGloss":"way","dependent":4,"dependentGloss":"worst"},{"dep":"case","governor":8,"governorGloss":"sentiment","dependent":6,"dependentGloss":"to"},{"dep":"compound","governor":8,"governorGloss":"sentiment","dependent":7,"dependentGloss":"test"},{"dep":"nmod:to","governor":5,"governorGloss":"way","dependent":8,"dependentGloss":"sentiment"},{"dep":"advmod","governor":5,"governorGloss":"way","dependent":9,"dependentGloss":"ever"},{"dep":"punct","governor":5,"governorGloss":"way","dependent":10,"dependentGloss":"."}],"enhancedPlusPlusDependencies":[{"dep":"ROOT","governor":0,"governorGloss":"ROOT","dependent":5,"dependentGloss":"way"},{"dep":"nsubj","governor":5,"governorGloss":"way","dependent":1,"dependentGloss":"This"},{"dep":"cop","governor":5,"governorGloss":"way","dependent":2,"dependentGloss":"is"},{"dep":"det","governor":5,"governorGloss":"way","dependent":3,"dependentGloss":"the"},{"dep":"amod","governor":5,"governorGloss":"way","dependent":4,"dependentGloss":"worst"},{"dep":"case","governor":8,"governorGloss":"sentiment","dependent":6,"dependentGloss":"to"},{"dep":"compound","governor":8,"governorGloss":"sentiment","dependent":7,"dependentGloss":"test"},{"dep":"nmod:to","governor":5,"governorGloss":"way","dependent":8,"dependentGloss":"sentiment"},{"dep":"advmod","governor":5,"governorGloss":"way","dependent":9,"dependentGloss":"ever"},{"dep":"punct","governor":5,"governorGloss":"way","dependent":10,"dependentGloss":"."}],"sentimentValue":"0","sentiment":"Verynegative","tokens":[{"index":1,"word":"This","originalText":"This","characterOffsetBegin":0,"characterOffsetEnd":4,"pos":"DT","before":"","after":" "},{"index":2,"word":"is","originalText":"is","characterOffsetBegin":5,"characterOffsetEnd":7,"pos":"VBZ","before":" ","after":" "},{"index":3,"word":"the","originalText":"the","characterOffsetBegin":8,"characterOffsetEnd":11,"pos":"DT","before":" ","after":" "},{"index":4,"word":"worst","originalText":"worst","characterOffsetBegin":12,"characterOffsetEnd":17,"pos":"JJS","before":" ","after":" "},{"index":5,"word":"way","originalText":"way","characterOffsetBegin":18,"characterOffsetEnd":21,"pos":"NN","before":" ","after":" "},{"index":6,"word":"to","originalText":"to","characterOffsetBegin":22,"characterOffsetEnd":24,"pos":"TO","before":" ","after":" "},{"index":7,"word":"test","originalText":"test","characterOffsetBegin":25,"characterOffsetEnd":29,"pos":"NN","before":" ","after":" "},{"index":8,"word":"sentiment","originalText":"sentiment","characterOffsetBegin":30,"characterOffsetEnd":39,"pos":"NN","before":" ","after":" "},{"index":9,"word":"ever","originalText":"ever","characterOffsetBegin":40,"characterOffsetEnd":44,"- 100%[==============================================================================================================>] 4.30K --.-KB/s in 0s
The tool gives you a ton of data on how it ran it's NLP analysis as well as giving you back your sentiment results. You can configure different properties for different language processing. This is well documented by Stanford. Stanford CoreNLP Server UI You not only get a REST API, you also get a nice front-end Accessing From Apache NiFi
Step 1: Get some Data (GetTwitter works nice) Step 2: Build a File with just 1 field to send (I extract the Twitter message and then convert that to a FlowFile with no JSON) Step 3: InvokeHTTP to call Sentiment Server http://localhost:9000/?properties=%7B%22annotators%22%3A%22tokenize%2Cssplit%2Cparse%2Csentiment%22%2C%22outputFormat%22%3A%22json%22%7D Make sure you set Content-Type to application/json, set the Message Body to True, Always output Response to true, Follow Redirects to true, and HTTP Method to POST. Step 4: Use the JSON NLP Results The server will also allow you to receive text and XML. JSON is easy to work with. {
"sentences" : [ {
"index" : 0,
"parse" : "(ROOT\n (FRAG\n (NP (NNP RT) (NNP @MikeTamir))\n (: :)\n (S\n (NP (NNP Google))\n (VP (VBG betting)\n (ADJP (JJ big)\n (PP (IN on)\n (S\n (VP (VBG #DeepLearning)\n (NP\n (NP (JJ #AI) (VBG #MachineLearning) (NN #DataScience))\n (: :)\n (NP (NNP Sundar) (NNP Pichai) (NNPS https://t.co/r5X4AnhXUo) (NNP https://t.co/c…)))))))))))",
"basicDependencies" : [ {
"dep" : "ROOT",
"governor" : 0,
"governorGloss" : "ROOT",
"dependent" : 2,
"dependentGloss" : "@MikeTamir"
}, {
"dep" : "compound",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 1,
"dependentGloss" : "RT"
}, {
"dep" : "punct",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 3,
"dependentGloss" : ":"
}, {
"dep" : "nsubj",
"governor" : 5,
"governorGloss" : "betting",
"dependent" : 4,
"dependentGloss" : "Google"
}, {
"dep" : "parataxis",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 5,
"dependentGloss" : "betting"
}, {
"dep" : "xcomp",
"governor" : 5,
"governorGloss" : "betting",
"dependent" : 6,
"dependentGloss" : "big"
}, {
"dep" : "mark",
"governor" : 8,
"governorGloss" : "#DeepLearning",
"dependent" : 7,
"dependentGloss" : "on"
}, {
"dep" : "advcl",
"governor" : 6,
"governorGloss" : "big",
"dependent" : 8,
"dependentGloss" : "#DeepLearning"
}, {
"dep" : "amod",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 9,
"dependentGloss" : "#AI"
}, {
"dep" : "amod",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 10,
"dependentGloss" : "#MachineLearning"
}, {
"dep" : "dobj",
"governor" : 8,
"governorGloss" : "#DeepLearning",
"dependent" : 11,
"dependentGloss" : "#DataScience"
}, {
"dep" : "punct",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 12,
"dependentGloss" : ":"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 13,
"dependentGloss" : "Sundar"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 14,
"dependentGloss" : "Pichai"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 15,
"dependentGloss" : "https://t.co/r5X4AnhXUo"
}, {
"dep" : "dep",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 16,
"dependentGloss" : "https://t.co/c…"
} ],
"enhancedDependencies" : [ {
"dep" : "ROOT",
"governor" : 0,
"governorGloss" : "ROOT",
"dependent" : 2,
"dependentGloss" : "@MikeTamir"
}, {
"dep" : "compound",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 1,
"dependentGloss" : "RT"
}, {
"dep" : "punct",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 3,
"dependentGloss" : ":"
}, {
"dep" : "nsubj",
"governor" : 5,
"governorGloss" : "betting",
"dependent" : 4,
"dependentGloss" : "Google"
}, {
"dep" : "parataxis",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 5,
"dependentGloss" : "betting"
}, {
"dep" : "xcomp",
"governor" : 5,
"governorGloss" : "betting",
"dependent" : 6,
"dependentGloss" : "big"
}, {
"dep" : "mark",
"governor" : 8,
"governorGloss" : "#DeepLearning",
"dependent" : 7,
"dependentGloss" : "on"
}, {
"dep" : "advcl:on",
"governor" : 6,
"governorGloss" : "big",
"dependent" : 8,
"dependentGloss" : "#DeepLearning"
}, {
"dep" : "amod",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 9,
"dependentGloss" : "#AI"
}, {
"dep" : "amod",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 10,
"dependentGloss" : "#MachineLearning"
}, {
"dep" : "dobj",
"governor" : 8,
"governorGloss" : "#DeepLearning",
"dependent" : 11,
"dependentGloss" : "#DataScience"
}, {
"dep" : "punct",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 12,
"dependentGloss" : ":"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 13,
"dependentGloss" : "Sundar"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 14,
"dependentGloss" : "Pichai"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 15,
"dependentGloss" : "https://t.co/r5X4AnhXUo"
}, {
"dep" : "dep",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 16,
"dependentGloss" : "https://t.co/c…"
} ],
"enhancedPlusPlusDependencies" : [ {
"dep" : "ROOT",
"governor" : 0,
"governorGloss" : "ROOT",
"dependent" : 2,
"dependentGloss" : "@MikeTamir"
}, {
"dep" : "compound",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 1,
"dependentGloss" : "RT"
}, {
"dep" : "punct",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 3,
"dependentGloss" : ":"
}, {
"dep" : "nsubj",
"governor" : 5,
"governorGloss" : "betting",
"dependent" : 4,
"dependentGloss" : "Google"
}, {
"dep" : "parataxis",
"governor" : 2,
"governorGloss" : "@MikeTamir",
"dependent" : 5,
"dependentGloss" : "betting"
}, {
"dep" : "xcomp",
"governor" : 5,
"governorGloss" : "betting",
"dependent" : 6,
"dependentGloss" : "big"
}, {
"dep" : "mark",
"governor" : 8,
"governorGloss" : "#DeepLearning",
"dependent" : 7,
"dependentGloss" : "on"
}, {
"dep" : "advcl:on",
"governor" : 6,
"governorGloss" : "big",
"dependent" : 8,
"dependentGloss" : "#DeepLearning"
}, {
"dep" : "amod",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 9,
"dependentGloss" : "#AI"
}, {
"dep" : "amod",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 10,
"dependentGloss" : "#MachineLearning"
}, {
"dep" : "dobj",
"governor" : 8,
"governorGloss" : "#DeepLearning",
"dependent" : 11,
"dependentGloss" : "#DataScience"
}, {
"dep" : "punct",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 12,
"dependentGloss" : ":"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 13,
"dependentGloss" : "Sundar"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 14,
"dependentGloss" : "Pichai"
}, {
"dep" : "compound",
"governor" : 16,
"governorGloss" : "https://t.co/c…",
"dependent" : 15,
"dependentGloss" : "https://t.co/r5X4AnhXUo"
}, {
"dep" : "dep",
"governor" : 11,
"governorGloss" : "#DataScience",
"dependent" : 16,
"dependentGloss" : "https://t.co/c…"
} ],
"sentimentValue" : "1",
"sentiment" : "Negative",
"tokens" : [ {
"index" : 1,
"word" : "RT",
"originalText" : "RT",
"characterOffsetBegin" : 0,
"characterOffsetEnd" : 2,
"pos" : "NN",
"before" : "",
"after" : " "
}, {
"index" : 2,
"word" : "@MikeTamir",
"originalText" : "@MikeTamir",
"characterOffsetBegin" : 3,
"characterOffsetEnd" : 13,
"pos" : "NN",
"before" : " ",
"after" : ""
}, {
"index" : 3,
"word" : ":",
"originalText" : ":",
"characterOffsetBegin" : 13,
"characterOffsetEnd" : 14,
"pos" : ":",
"before" : "",
"after" : " "
}, {
"index" : 4,
"word" : "Google",
"originalText" : "Google",
"characterOffsetBegin" : 15,
"characterOffsetEnd" : 21,
"pos" : "NNP",
"before" : " ",
"after" : " "
}, {
"index" : 5,
"word" : "betting",
"originalText" : "betting",
"characterOffsetBegin" : 22,
"characterOffsetEnd" : 29,
"pos" : "VBG",
"before" : " ",
"after" : " "
}, {
"index" : 6,
"word" : "big",
"originalText" : "big",
"characterOffsetBegin" : 30,
"characterOffsetEnd" : 33,
"pos" : "JJ",
"before" : " ",
"after" : " "
}, {
"index" : 7,
"word" : "on",
"originalText" : "on",
"characterOffsetBegin" : 34,
"characterOffsetEnd" : 36,
"pos" : "IN",
"before" : " ",
"after" : " "
}, {
"index" : 8,
"word" : "#DeepLearning",
"originalText" : "#DeepLearning",
"characterOffsetBegin" : 37,
"characterOffsetEnd" : 50,
"pos" : "NN",
"before" : " ",
"after" : " "
}, {
"index" : 9,
"word" : "#AI",
"originalText" : "#AI",
"characterOffsetBegin" : 51,
"characterOffsetEnd" : 54,
"pos" : "NN",
"before" : " ",
"after" : " "
}, {
"index" : 10,
"word" : "#MachineLearning",
"originalText" : "#MachineLearning",
"characterOffsetBegin" : 55,
"characterOffsetEnd" : 71,
"pos" : "NN",
"before" : " ",
"after" : " "
}, {
"index" : 11,
"word" : "#DataScience",
"originalText" : "#DataScience",
"characterOffsetBegin" : 72,
"characterOffsetEnd" : 84,
"pos" : "NN",
"before" : " ",
"after" : " "
}, {
"index" : 12,
"word" : ":",
"originalText" : ":",
"characterOffsetBegin" : 85,
"characterOffsetEnd" : 86,
"pos" : ":",
"before" : " ",
"after" : " "
}, {
"index" : 13,
"word" : "Sundar",
"originalText" : "Sundar",
"characterOffsetBegin" : 87,
"characterOffsetEnd" : 93,
"pos" : "NNP",
"before" : " ",
"after" : " "
}, {
"index" : 14,
"word" : "Pichai",
"originalText" : "Pichai",
"characterOffsetBegin" : 94,
"characterOffsetEnd" : 100,
"pos" : "NNP",
"before" : " ",
"after" : " "
}, {
"index" : 15,
"word" : "https://t.co/r5X4AnhXUo",
"originalText" : "https://t.co/r5X4AnhXUo",
"characterOffsetBegin" : 101,
"characterOffsetEnd" : 124,
"pos" : "NN",
"before" : " ",
"after" : " "
}, {
"index" : 16,
"word" : "https://t.co/c…",
"originalText" : "https://t.co/c…",
"characterOffsetBegin" : 125,
"characterOffsetEnd" : 140,
"pos" : "NN",
"before" : " ",
"after" : ""
} ]
} ]
}
Reference: Another simple option for Sentiment Analysis and NLP integration is to use Apache NiFi's ExecuteScript to call various Python libraries. This is well documented here: https://community.hortonworks.com/articles/76935/using-sentiment-analysis-and-nlp-tools-with-hdp-25.html http://stanfordnlp.github.io/CoreNLP/ https://github.com/stanfordnlp/CoreNLP/ http://stanfordnlp.github.io/CoreNLP/download.html
... View more
01-30-2017
06:08 AM
6 Kudos
Open NLP Example
Apache NiFi Processor I wanted to be able to add NLP processing to my dataflow without calling to Apache Spark jobs or other disconnected ways. A custom processor let's me write fast Java 8 microservices to process functionality in my stream in a concise way. All the source code for this processor is available with the Apache license in github. So I wrote one. See the attached generated HTML documentation for the processor. If you would like to use this processor. git clone https://github.com/tspannhw/nifi-nlp-processor
mvn package
cp nifi-nlp-nar/target/nifi-nlp-nar-1.0.nar /usr/hdf/current/nifi/lib/ You can also download a prebuilt NAR from github. Then restart NIFI via Ambari and you can start using it. This has been tested for HDF 2.x NiFi. Add the NLP Processor. Then set the properties, you need to set sentence that you want parsed. You can use expression language to grab a field from an attribute like I am doing to grab the Tweet. Send it a sentence, say from Twitter and you will get back. You need to set the Extra Resources to a directory where you have downloaded the Apache OpenNLP prebuilt models referenced below. Results Two attributes get added to your flow. They contain JSON arrays of locations and names extracted from your sentence (or page of text). Locations {"locations":[{"location":"Sydney"}]} Names
{"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]} Entities extracted from the text using Apache OpenNLP via a
custom NiFi Processor. Current Version Uses (Apache OpenNLP Pre-built Models v1.5) en-token.bin en-ner-person.bin en-ner-location.bin You can add other languages and models as enhancements. If you would like to extend the processor, it includes a JUnit test for you to run and extend. If uses the NiFi TestRunner and will allow you to see the flowfile, set inputs and get outputs. Note: The current version supports English only, if you want to extend it, please fork the project and I will merge code in. References: Models to Download and install to /usr/hdf/current/nifi/lib/ http://opennlp.sourceforge.net/models-1.5/ https://community.hortonworks.com/articles/76240/using-opennlp-for-identifying-names-from-text.html twittertonlp.xml
... View more
Labels:
01-30-2017
05:15 AM
4 Kudos
Working with airbnb's Superset This is a very cool open source Analytics platform based on
some cool Python. I installed this on a CentOS 7 edge node. sudo yum upgrade python-setuptools
sudo yum install gcc libffi-devel python-devel python-pip
python-wheel openssl-devel libsasl2-devel openldap-devel
pip install virtualenv
virtualenv venv. ./venv/bin/activate
pip install --upgrade setuptools pip
pip install mysqlclient
pip install pyhive
pip install superset
fabmanager create-admin --app superset
2017-01-27 18:15:37,864:INFO:flask_appbuilder.security.sqla.manager:Created Permission
View: menu access on Query Search2017-01-27
18:15:37,885:INFO:flask_appbuilder.security.sqla.manager:Added Permission menu
access on Query Search to role AdminRecognized Database Authentications.2017-01-27
18:15:37,907:INFO:flask_appbuilder.security.sqla.manager:Added user admin
Admin User admin created.
superset db upgrade
superset load_examples
superset init
superset runserver -p 8088 The main things you will need are Python Browse to http://yourservername:8088/ and start running querys, building charts and reports. It does a lot of things that commercial reporting tools do, but fully open source. Superset + Zeppelin + CLI + ODBC + JDBC give me all the access to my Hadoop, Druid, SparkSQL and MariaDB data that I need. This is admin with the password you set in the fabmanager create admin. Browsing tables is easy in the web based platform. The results of running a query which shows the intellisense that suggests table names for you. This was a built-in example report that shows you how powerful and professional reports you can build with this tool. The SQL Lab is a great place to try out queries and examine data. SQL Lab lets you run queries and explore the data. You get quick access to your previous queries and run status. This is a simple report that was autogenerated for me by picking a query on one table. This is your home page that will show you dashboards you have built and recent activity. A very nice github style interface. Reference:
http://airbnb.io/superset/installation.html https://pypi.python.org/pypi/PyHive http://airbnb.io/superset/ https://github.com/airbnb/superset http://druid.io
... View more
Labels:
01-28-2017
04:50 PM
3 Kudos
Preparing a Raspberry PI to Run TensorFlow Image Recognition I can easily have a Python script that polls my webcam (use
official Raspberry Pi webcam) , calls TensorFlow and then sends the results to
NiFi via MQTT. You need to install Python MQTT Library (https://pypi.python.org/pypi/paho-mqtt/1.1) For setting up Python, Raspberry PI with Camera, see https://dzone.com/articles/picamera-ingest-real-time Raspberry Pi 3 B+ preparation Buy a good quality 16 GIG SD Card and from OSX, Run SD Formatter to Overwrite Format the device at FAT, download here: https://www.sdcard.org/downloads/formatter_4/. Download the BerryBoot image from here. Unzip it and then copy it to your complete SD card. For examples of RPi TensorFlow You Can Run: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/pi_examples/ You need to build tensorflow for pi, which took me over 4 hours. See: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/makefile https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/pi_examples/ Process: wget https://github.com/tensorflow/tensorflow/archive/master.zipapt-get install -y libjpeg-devcd tensorflow-mastertensorflow/contrib/makefile/download_dependencies.shsudo apt-get install -y autoconf automake libtool gcc-4.8
g++-4.8cd tensorflow/contrib/makefile/downloads/protobuf/./autogen.sh./configuremakesudo make installsudo ldconfig #
refresh shared library cachecd ../../../../..make -f tensorflow/contrib/makefile/Makefile HOST_OS=PI
TARGET=PI \ OPTFLAGS="-Os
-mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize" CXX=g++-4.8curl
https://storage.googleapis.com/download.tensorflow.org/models/inception_dec_2015_stripped.zip
\-o /tmp/inception_dec_2015_stripped.zipunzip /tmp/inception_dec_2015_stripped.zip \-d tensorflow/contrib/pi_examples/label_image/data/make -f tensorflow/contrib/pi_examples/label_image/Makefile root@raspberrypi:/opt/demo/tensorflow-master#
tensorflow/contrib/pi_examples/label_image/gen/bin/label_image2017-01-28 01:46:48: I tensorflow/contrib/pi_examples/label_image/label_image.cc:144]
Loaded JPEG: 512x600x32017-01-28 01:46:50: W
tensorflow/core/framework/op_def_util.cc:332] Op
BatchNormWithGlobalNormalization is deprecated. It will cease to work in
GraphDef version 9. Use tf.nn.batch_normalization().2017-01-28 01:46:52: I
tensorflow/contrib/pi_examples/label_image/label_image.cc:378] Running model
succeeded!2017-01-28 01:46:52: I
tensorflow/contrib/pi_examples/label_image/label_image.cc:272] military uniform
(866): 0.6242942017-01-28 01:46:52: I
tensorflow/contrib/pi_examples/label_image/label_image.cc:272] suit (794):
0.04739812017-01-28 01:46:52: I
tensorflow/contrib/pi_examples/label_image/label_image.cc:272] academic gown
(896): 0.02809252017-01-28 01:46:52: I tensorflow/contrib/pi_examples/label_image/label_image.cc:272]
bolo tie (940): 0.01569552017-01-28 01:46:52: I
tensorflow/contrib/pi_examples/label_image/label_image.cc:272] bearskin (849):
0.0143348 It took over 4 hours to build. But only 4 seconds to run and gave good results for analyzing a picture of Computer Legend Grace Hopper. root@raspberrypi:/opt/demo/tensorflow-master#
tensorflow/contrib/pi_examples/label_image/gen/bin/label_image --help2017-01-28 01:51:26: E
tensorflow/contrib/pi_examples/label_image/label_image.cc:337]usage: tensorflow/contrib/pi_examples/label_image/gen/bin/label_imageFlags: --image="tensorflow/contrib/pi_examples/label_image/data/grace_hopper.jpg" string image to be processed --graph="tensorflow/contrib/pi_examples/label_image/data/tensorflow_inception_stripped.pb" string graph
to be executed --labels="tensorflow/contrib/pi_examples/label_image/data/imagenet_comp_graph_label_strings.txt" string name
of file containing labels --input_width=299 int32 resize image to this
width in pixels --input_height=299 int32 resize image to this height in pixels --input_mean=128 int32 scale pixel values to
this mean --input_std=128 int32 scale pixel values to this std deviation --input_layer="Mul" string name of input layer --output_layer="softmax" string name of output layer --self_test=false bool run a self test --root_dir="" string interpret image and graph file names relative
to this directory
... View more
Labels:
01-25-2017
10:47 PM
To add to the configuration for more sudo service osqueryd restart I turned on some extra packs "packs": {
"osquery-monitoring": "/usr/share/osquery/packs/osquery-monitoring.conf",
"incident-response": "/usr/share/osquery/packs/incident-response.conf",
"it-compliance": "/usr/share/osquery/packs/it-compliance.conf",
// "osx-attacks": "/usr/share/osquery/packs/osx-attacks.conf",
// "vuln-management": "/usr/share/osquery/packs/vuln-management.conf",
"hardware-monitoring": "/usr/share/osquery/packs/hardware-monitoring.conf"
} In /etc/osquery/osquery.conf
... View more
01-25-2017
10:07 PM
4 Kudos
OSQuery OSQuery is a cool tool that lets you query your servers via
SQL. It supports Windows, OSX and most
Linux variants. Installing osquery wget https://osquery-packages.s3.amazonaws.com/centos7/osquery-2.2.1-1.el7.x86_64.rpmrpm -ivh osquery-2.2.1-1.el7.x86_64.rpmsudo cp /usr/share/osquery/osquery.example.conf
/etc/osquery/osquery.confsudo service osqueryd startRedirecting to /bin/systemctl start osqueryd.servicesudo service osqueryd statusRedirecting to /bin/systemctl status osqueryd.service● osqueryd.service - The osquery Daemon Loaded: loaded
(/usr/lib/systemd/system/osqueryd.service; disabled; vendor preset: disabled) Active: active
(running) since Wed 2017-01-25 20:32:06 UTC; 4s ago Process: 21531
ExecStartPre=/bin/sh -c if [ ! -f $FLAG_FILE ]; then touch $FLAG_FILE; fi
(code=exited, status=0/SUCCESS) Main PID: 21534
(osqueryd) CGroup:
/system.slice/osqueryd.service ├─21534 /usr/bin/osqueryd --flagfile
/etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf └─21537
osqueryd: workerJan 25 20:32:06 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:06.240648 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:06 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:06.248845 21543 query.cpp:68] Storing initial
results for new scheduled query: system_infoJan 25 20:32:06 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:06.540765 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:06 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:06.836784 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:07 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:07.134472 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:07 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:07.414026 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:09 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:09.205369 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:09 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:09.495270 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:09 tspanndev10.field.hortonworks.com osqueryd[21534]:
I0125 20:32:09.792325 21543 scheduler.cpp:63] Executing scheduled query:
system_info: SELECT hostname, cpu_brand, physica...ystem_info;Jan 25 20:32:10 tspanndev10.field.hortonworks.com
osqueryd[21534]: I0125 20:32:10.083355 21543 scheduler.cpp:63] Executing
scheduled query: system_info: SELECT hostname, cpu_brand, physica...ystem_info;Hint: Some lines were ellipsized, use -l to show in full.
[root@tspanndev10 demo]# osqueryiUsing a virtual database. Need help, type '.help'osquery> .exit[root@tspanndev10 demo]# osqueryi --json "select * from
routes where destination = '::1'"[
{"destination":"::1","flags":"0","gateway":"","interface":"lo","metric":"0","mtu":"0","netmask":"0","source":"","type":"local"}][root@tspanndev10 demo]# osqueryi --json ".tables" => acpi_tables => apt_sources => arp_cache => augeas =>
authorized_keys => block_devices =>
carbon_black_info =>
chrome_extensions => cpu_time => cpuid => crontab => deb_packages => device_file => device_hash =>
device_partitions =>
disk_encryption => dns_resolvers => etc_hosts => etc_protocols => etc_services => file => file_events => firefox_addons => groups =>
hardware_events => hash =>
interface_addresses =>
interface_details => iptables => kernel_info =>
kernel_integrity => kernel_modules => known_hosts => last =>
listening_ports =>
logged_in_users => magic => memory_info => memory_map => mounts => msr => opera_extensions => os_version => osquery_events =>
osquery_extensions => osquery_flags => osquery_info => osquery_packs =>
osquery_registry =>
osquery_schedule => pci_devices => platform_info =>
portage_keywords => portage_packages => portage_use => process_envs => process_events =>
process_memory_map =>
process_open_files =>
process_open_sockets => processes => routes =>
rpm_package_files => rpm_packages => shared_memory => shell_history => smbios_tables => socket_events => sudoers => suid_bin => syslog =>
system_controls => system_info => time => uptime => usb_devices => user_events => user_groups => user_ssh_keys => users => yara => yara_eventsosqueryi --json "select * from system_info"[ {"computer_name":"timserver.com","cpu_brand":"Intel
Xeon E312xx (Sandy
Bridge)","cpu_logical_cores":"8","cpu_physical_cores":"8","cpu_subtype":"42","cpu_type":"6","hardware_model":"OpenStack
Nova","hardware_serial":"00000000-0000-0000-0000-0cc47ab4bfdc","hardware_vendor":"OpenStack
Foundation","hardware_version":"13.1.1","hostname":"timserver.com","physical_memory":"15601471488","uuid":"0BDAB55A-3709-41BA-85A8-84CB628BACF2"}]/var/log/osqueryosqueryd.INFOosqueryd.results.log Result Through NIFI [
{"computer_name":"tspannserver","cpu_brand":"Intel
Xeon E312xx (Sandy
Bridge)","cpu_logical_cores":"8","cpu_physical_cores":"8","cpu_subtype":"42","cpu_type":"6","hardware_model":"","hardware_serial":"","hardware_vendor":"","hardware_version":"","hostname":"tspannserver","physical_memory":"15601471488","uuid":"e877cbb9-175e-48c8-a6d9-ff824791d204"}] JSON Path Extraction $.[0].computer_name
Apache
Phoenix Table CREATE TABLE osquery (uuid varchar not null primary key, computer_namevarchar, cpu_logical_cores varchar, filename
varchar, cpu_physical_cores varchar,cpu_brand varchar, physical_memory varchar); Phoenix
Query upsert into osquery (uuid, computer_name,
cpu_logical_cores, filename, cpu_physical_cores, cpu_brand, physical_memory) values ('${'uuid'}','${'computer_name'}','${cpu_logical_core}','${'filename'}','${'cpu_physical_cores'}','${'cpu_brand'}','${'physical_memory'}') Caveat: If you have a type mismatch on an Upsert 21:54:45 UTC ERROR 30d6398f-310f-1cac-b1d6-39d48b542b1e server:port PutSQL[id=30d6398f-310f-1cac-b1d6-39d48b542b1e]
Failed to update database for
[StandardFlowFileRecord[uuid=0228c884-cff8-4468-a082-d24cf9df6c11,claim=StandardContentClaim
[resourceClaim=StandardResourceClaim[id=1485381190973-6753, container=default,
section=609], offset=162842,
length=277],offset=0,name=2355976778239757,size=277]] due to
org.apache.phoenix.exception.BatchUpdateExecution: ERROR 1106 (XCL06):
Exception while executing batch.; it is possible that retrying the operation
will succeed, so routing to retry:
org.apache.phoenix.exception.BatchUpdateExecution: ERROR 1106 (XCL06):
Exception while executing batch. Reference
https://osquery.readthedocs.io/en/stable/introduction/using-osqueryi/ https://osquery.io/ https://osquery.io/docs/tables/ http://jsonpath.com/
... View more
Labels:
01-20-2017
09:47 PM
3 Kudos
You can run the attack library for OSX or Linux from an edge node or from outside the cluster. I ran against mine from my OSX laptop against my cluster that I had network access to. You should try to scan from inside your network, from an edge node and from a remote site on the Internet. You will need Python 2.7 or Python 3.x installed first. git clone git@github.com:CERT-W/hadoop-attack-library.git
pip install requests lxml You may need root or sudo access to install on your machine. One of the scanners hits the WebHDFS link that you may have seen a warning about. python hdfsbrowser.py timscluster
Beginning to test services accessibility using default ports ...
Testing service WebHDFS
[+] Service WebHDFS is available
Testing service HttpFS
[-] Exception during requesting the service
[+] Sucessfully retrieved 1 services
drwxrwxrwx hdfs:hdfs 2017-01-15T05:50:27+0000 /
drwxrwxrwx yarn:hadoop 2017-01-11T19:25:26+0000 app-logs /app-logs
drwxrwxrwx hdfs:hdfs 2016-12-21T23:12:40+0000 apps /apps
drwxrwxrwx yarn:hadoop 2016-09-15T21:02:30+0000 ats /ats
drwxrwxrwx root:hdfs 2016-12-21T23:08:34+0000 avroresults /avroresults
drwxrwxrwx hdfs:hdfs 2016-12-13T03:42:55+0000 banking /banking
To see how available your Hadoop configurations are available, you can use Hadoop Snooper. This is under: Tools\ Techniques\ and\ Procedures \ Getting\ the\ target\ environment\ configuration python hadoopsnooper.py timscluster -o test
Specified destination path does not exist, do you want to create it ? [y/N]y
[+] Creating configuration directory
[+] core-site.xml successfully created
[+] mapred-site.xml successfully created
[+] yarn-site.xml successfully created This downloads all those configuration files to a directory named test. These were not the full configuration files, but they pointed to correct internal servers and give an attacker more information. Another scan worth running is sqlmap. This tool will let you check various SQL tools in the system. SQLMap requires Python 2.6 or 2.7. ➜ projects git clone https://github.com/sqlmapproject/sqlmap.git sqlmap-dev
Cloning into 'sqlmap-dev'...
remote: Counting objects: 55560, done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 55560 (delta 22), reused 0 (delta 0), pack-reused 55519
Receiving objects: 100% (55560/55560), 47.25 MiB | 2.28 MiB/s, done.
Resolving deltas: 100% (42960/42960), done.
Checking connectivity... done.
➜ projects python sqlmap.py --update
➜ projects cd sqlmap-dev
➜ sqlmap-dev git:(master) python sqlmap.py --update
___
__H__
___ ___[.]_____ ___ ___ {1.1.1.14#dev}
|_ -| . [)] | .'| . |
|___|_ [']_|_|_|__,| _|
|_|V |_| http://sqlmap.org
[!] legal disclaimer: Usage of sqlmap for attacking targets without prior mutual consent is illegal. It is the end user's responsibility to obey all applicable local, state and federal laws. Developers assume no liability and are not responsible for any misuse or damage caused by this program
[*] starting at 16:49:13
[16:49:13] [INFO] updating sqlmap to the latest development version from the GitHub repository
[16:49:13] [INFO] update in progress .
[16:49:14] [INFO] already at the latest revision 'f542e82'
[*] shutting down at 16:49:14
References: http://sqlmap.org/ http://www.slideshare.net/bunkertor/hadoop-security-54483815 http://tools.kali.org/ https://github.com/savio-code/hexorbase https://community.hortonworks.com/articles/73035/running-dns-and-domain-scanning-tools-from-apache.html
... View more
Labels:
01-15-2017
05:42 PM
2 Kudos
Raspberry PIs and other small devices often have cameras or can have camera's attached. Raspberry Pi's have cheap camera add-ons that can ingest still images and videos (https://www.raspberrypi.org/products/camera-module/). Using a simple Python script we can ingest images and then ingest them into our central Hadoop Data Lake. This is a nice simple use case for Connected Data Platforms with both Data in Motion and Data at Rest. This data can be processed in-line with Deep Learning Libraries like TensorFlow for image recognition and assessment. Using OpenCV and other tools we can process in-motion and look for issues like security breaches, leaks and other events.
The most difficult part is the Python code which reads from camera, adds a watermark, converts to bytes, sends to MQTT and then ftps to an FTP server. I do both since networking is always tricky. You could also add if it fails to connect to either, store to a directory on a mapped USB drive. Once network returns send it out, it would be easy to do that with MiniFi which could read that directory. Once the file lands into the MQTT broker or FTP server, NIFI pulls it and bring it into the flow. I first store to HDFS for our Data @ Rest permanent storage for future deep learning processing. I also run three processors to extra image metadata and then call jp2a to convert the image into an ASCII picture. ExecuteStreamCommand for Running jp2a The Output Ascii HDFS Directory of Uploaded Files Metadata extracted from the image An Example Imported Image Other Meta Data Meta Data Extracted A Converted JPG to ASCII Running JP2A on Images Stored in HDFS via WebHDFS REST API /opt/demo/jp2a-master/src/jp2a "http://hdfsnode:50070/webhdfs/v1/images/$@?op=OPEN" Python on RPI #!/usr/bin/python
import os
import datetime
import ftplib
import traceback
import math
import random, string
import base64
import json
import paho.mqtt.client as mqtt
import picamera
from time import sleep
from time import gmtime, strftime
packet_size=3000
def randomword(length):
return ''.join(random.choice(string.lowercase) for i in range(length))
# Create unique image name
img_name = 'pi_image_{0}_{1}.jpg'.format(randomword(3),strftime("%Y%m%d%H%M%S",gmtime()))
# Capture Image from Pi Camera
try:
camera = picamera.PiCamera()
camera.annotate_text = " Stored with Apache NiFi "
camera.capture(img_name, resize=(500,281))
pass
finally:
camera.close()
# MQTT
client = mqtt.Client()
client.username_pw_set("CloudMqttUserName","!MakeSureYouHaveAV@5&L0N6Pa55W0$4!")
client.connect("cloudmqttiothoster", 14162, 60)
f=open(img_name)
fileContent = f.read()
byteArr = bytearray(fileContent)
f.close()
message = '"image": {"bytearray":"' + byteArr + '"} } '
print client.publish("image",payload=message,qos=1,retain=False)
client.disconnect()
# FTP
ftp = ftplib.FTP()
ftp.connect("ftpserver", "21")
try:
ftp.login("reallyLongUserName", "FTP PASSWORDS SHOULD BE HARD")
ftp.storbinary('STOR ' + img_name, open(img_name, 'rb'))
finally:
ftp.quit()
# clean up sent file
os.remove(img_name)
References: https://community.hortonworks.com/repos/77987/rpi-picamera-mqtt-nifi.html?shortDescriptionMaxLength=140 https://github.com/bikash/RTNiFiStreamProcessors http://stackoverflow.com/questions/37499739/how-can-i-send-a-image-by-using-mosquitto https://www.raspberrypi.org/learning/getting-started-with-picamera/worksheet/ https://www.cloudmqtt.com/ https://developer.ibm.com/recipes/tutorials/sending-and-receiving-pictures-from-a-raspberry-pi-via-mqtt/ https://developer.ibm.com/recipes/tutorials/displaying-image-from-raspberry-pi-in-nodered-ui-hosted-on-bluemix/ https://www.raspberrypi.org/learning/getting-started-with-picamera/worksheet/ https://github.com/jpmens/twitter2mqtt http://www.ev3dev.org/docs/tutorials/sending-and-receiving-messages-with-mqtt/ https://github.com/njh/mqtt-http-bridge https://www.raspberrypi.org/learning/parent-detector/worksheet/ http://picamera.readthedocs.io/en/release-1.10/recipes1.html http://picamera.readthedocs.io/en/release-1.10/faq.html http://www.eclipse.org/paho/ http://picamera.readthedocs.io/en/release-1.10/recipes1.html#capturing-to-an-opencv-object https://github.com/cslarsen/jp2a https://www.raspberrypi.org/learning/getting-started-with-picamera/ https://www.raspberrypi.org/learning/tweeting-babbage/worksheet/ https://csl.name/jp2a/
... View more
01-12-2017
10:04 AM
4 Kudos
Some people say I must have a bot to read and reply to email at all crazy hours of the day. An awesome email assistant, well I decided to prototype it.
This is the first piece. After this I will add some Spark machine learning to intelligently reply to emails from a list of pretrained responses. With supervised learning it will learn what emails to send to who, based on Subject, From, Body Content, attachments, time of day, sender domain and many other variables.
For now, it just reads some emails and checks for a hard coded subject.
I could use this to trigger other processes, such as running a batch Spark job.
Since most people send and use HTML email (that's what Outlook, Outlook.com, Gmail do), I will send and receive HTML emails as to make it look more legit.
I could also run my fortune script and return that as my email content. Making me sound wise, or pull in a random selection of tweets about Hadoop or even recent news. Making the email very current and fresh.
Snippet Example of a Mixed Content Email Message (Attachments Removed to Save Space)
Return-Path: <x@example.com>
Delivered-To: nifi@example.com
Received: from x.x.net
by x.x.net (Dovecot) with LMTP id +5RhOfCcB1jpZQAAf6S19A
for <nifi@example.com>; Wed, 19 Oct 2016 12:19:13 -0400
Return-path: <x@example.com>
Envelope-to: nifi@example.com
Delivery-date: Wed, 19 Oct 2016 12:19:13 -0400
Received: from [x.x.x.x] (helo=smtp.example.com)
by x.example.com with esmtp (Exim)
id 1bwtaC-0006dd-VQ
for nifi@example.com; Wed, 19 Oct 2016 12:19:12 -0400
Received: from x.x.net ([x.x.x.x])
by x with bizsmtp
id xUKB1t0063zlEh401UKCnK; Wed, 19 Oct 2016 12:19:12 -0400
X-EN-OrigIP: 64.78.52.185
X-EN-IMPSID: xUKB1t0063zlEh401UKCnK
Received: from x.x.net (localhost [127.0.0.1])
(using TLSv1 with cipher AES256-SHA (256/256 bits))
(No client certificate requested)
by emg-ca-1-1.localdomain (Postfix) with ESMTPS id BEE9453F81
for <nifi@example.com>; Wed, 19 Oct 2016 09:19:10 -0700 (PDT)
Subject: test
MIME-Version: 1.0
x-echoworx-msg-id: e50ca00a-edc5-4030-a127-f5474adf4802
x-echoworx-emg-received: Wed, 19 Oct 2016 09:19:10.713 -0700
x-echoworx-message-code-hashed: 5841d9083d16bded28a3c4d33bc505206b431f7f383f0eb3dbf1bd1917f763e8
x-echoworx-action: delivered
Received: from 10.254.155.15 ([10.254.155.15])
by emg-ca-1-1 (JAMES SMTP Server 2.3.2) with SMTP ID 503
for <nifi@example.com>;
Wed, 19 Oct 2016 09:19:10 -0700 (PDT)
Received: from x.x.net (unknown [x.x.x.x])
(using TLSv1 with cipher AES256-SHA (256/256 bits))
(No client certificate requested)
by emg-ca-1-1.localdomain (Postfix) with ESMTPS id 6693053F86
for <nifi@example.com>; Wed, 19 Oct 2016 09:19:10 -0700 (PDT)
Received: from x.x.net (x.x.x.x) by
x.x.net (x.x.x.x) with Microsoft SMTP
Server (TLS) id 15.0.1178.4; Wed, 19 Oct 2016 09:19:09 -0700
Received: from x.x.x.net ([x.x.x.x]) by
x.x.x.net ([x.x.x.x]) with mapi id
15.00.1178.000; Wed, 19 Oct 2016 09:19:09 -0700
From: x x<x@example.com>
To: "nifi@example.com" <nifi@example.com>
Thread-Topic: test
Thread-Index: AQHSKiSFTVqN9ugyLEirSGxkMiBNFg==
Date: Wed, 19 Oct 2016 16:19:09 +0000
Message-ID: <D49AD137-3765-4F9A-BF98-C4E36D11FFD8@hortonworks.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [71.168.178.39]
x-source-routing-agent: Processed
Content-Type: multipart/related;
boundary="_004_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_";
type="multipart/alternative"
--_004_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_
Content-Type: multipart/alternative;
boundary="_000_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_"
--_000_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Python Script to Parse Email Messages
#!/usr/bin/env python
"""Unpack a MIME message into a directory of files."""
import json
import os
import sys
import email
import errno
import mimetypes
from optparse import OptionParser
from email.parser import Parser
def main():
parser = OptionParser(usage="""Unpack a MIME message into a directory of files.
Usage: %prog [options] msgfile
""")
parser.add_option('-d', '--directory',
type='string', action='store',
help="""Unpack the MIME message into the named
directory, which will be created if it doesn't already
exist.""")
opts, args = parser.parse_args()
if not opts.directory:
os.makedirs(opts.directory)
try:
os.mkdir(opts.directory)
except OSError as e:
# Ignore directory exists error
if e.errno != errno.EEXIST:
raise
msgstring = ''.join(str(x) for x in sys.stdin.readlines())
msg = email.message_from_string(msgstring)
headers = Parser().parsestr(msgstring)
response = {'To': headers['to'], 'From': headers['from'], 'Subject': headers['subject'], 'Received': headers['Received']}
print json.dumps(response)
counter = 1
for part in msg.walk():
# multipart/* are just containers
if part.get_content_maintype() == 'multipart':
continue
# Applications should really sanitize the given filename so that an
# email message can't be used to overwrite important files
filename = part.get_filename()
if not filename:
ext = mimetypes.guess_extension(part.get_content_type())
if not ext:
# Use a generic bag-of-bits extension
ext = '.bin'
filename = 'part-%03d%s' % (counter, ext)
counter += 1
fp = open(os.path.join(opts.directory, filename), 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
if __name__ == '__main__':
main()
mailnifi.sh
python mailnifi.py -d /opt/demo/email/"$@"
Python needs the email component for parsing the message, you can install via PIP.
pip install email
I am using Python 2.7, you could use a newer Python 3.x
Here is the flow:
For the final part of the flow, I read the files created by the parsing, load them to HDFS and delete from the file system using the standard GetFile.
Reference:
https://docs.python.org/2/library/email-examples.html
https://jsonpath.curiousconcept.com/
Files:
email-assistant-12-jan-2017.xml
... View more
Labels: