About TimothySpann

TimothySpann · ‎08-27-2016

I am running a free mongodb instance (DBaaS) on at mlab.com. From NiFi, I can read from and write to MongoDB very easily. It is a great way to pull data out of a large collection of MongoDB databases; in some startups or enterprises a lot of little MEAN apps have been written and have small silos of data locked in MongoDB datasets. These can be streamed into a data lake very easily with NiFi. Once stored in HDFS, the data can be accessed via SparkSQL, Hive, Zeppelin and other tools very easily. An example of a Twitter tweet as a JSON document in a MongoDB collection in a MongoDB database stored in an online NoSQL store. The nifi flow for storing to MongoDB is trivial. A simple flow to read MongoDB JSON records and land them as JSON in HDFS. Here is an example of another source to store to HDFS or MongoDB for example. We use a GetHTTP processor to access an SSL protected resource. There are a few options for storing something to MongoDB. You need to format the mongodb URI correctly. You need the username:password@yoururl. Then set your database and collection name. Insert mode is most common, but you can do an upsert. There are a few options for writing to MongoDB. Write Concern acknowledged if you want to get an acknowledgement of storing to all nodes in MongoDB cluster. https://docs.mongodb.com/v3.0/reference/write-concern/

TimothySpann · ‎08-24-2016

it's connected the JDK JRE used to run NIFI such as /opt/jdk1.8.0_91/jre/lib/security/cacerts https://docs.oracle.com/cd/E19957-01/817-3331/6miuccqo3/index.html default password changeit it's a JKS SSL requires this in any Java application, it's a thing. The browser does this for you automagically.

TimothySpann · ‎08-23-2016

/opt/jdk1.8.0_91/bin/keytool -list -v -keystore /opt/jdk1.8.0_91/jre/lib/security/cacerts Use the same JDK as NIFI for keytool it will ask for the changeit password https://www.sslshopper.com/article-most-common-java-keytool-keystore-commands.html

TimothySpann · ‎08-23-2016

make sure you are using the JVM that is running HDF/NiFi. Also make sure that the NiFi user has read and execute Linux permissions. /opt/jdk1.8.0_91/jre/lib/security/cacerts Also I am not sure you want the Oracle JVM, perhaps the full JDK http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html Are you on Ubuntu? run nifi.sh status or sudo service nifi status and see what JDK it is using. That is the one that needs the cert.

TimothySpann · ‎08-23-2016

You can also easily save to ORC, Parquet or another format. I recommend ORC so you can do fast queries from Hive.

TimothySpann · ‎08-23-2016

put in a full path, for me running in hadoop it is saved to HDFS, see below. if you are running standalone not compiled with hadoop, it will store to a local file system, probably /<YOURCURRENTUSER>/something or /tmp check the spark history UI

TimothySpann · ‎08-23-2016

once you do a write (df1.write.format("orc").mode(org.) it will save it to an HDFS or local directory depending if you have hadoop enabled. it will create it under the running user like /root/<your name> or under a spark user directory if that was created. it doesn't hurt to put in a full path for the write like /mystuff/awesome. To read an AVRO file, use the avro tools see here: https://github.com/airisdata/avroparquet

TimothySpann · ‎08-22-2016

password is changeit http://certificate.fyicenter.com/120_Java_VM_Password_for_cacerts__Java_System_Keystore.html unless you changed the default for the JVM

TimothySpann · ‎08-22-2016

Works for me with a valid SSLContext. https://community.hortonworks.com/users/12595/jayachanderit.html @Jay See Do you have an SSLContext? Your URL is SSL, you need a working SSL connection and no firewall blocking it somewhere. See here: https://community.hortonworks.com/content/kbentry/47854/accessing-facebook-page-data-from-apache-nifi.html Works fine. -rw-r--r-- 1 root root 17334 2016-08-22 21:25 21318561924230 [root@sandbox jdk1.8.0_91]# cat /opt/demo/data/21318561924230 {"channel":{"id":64519,"name":"Hessle IOT","description":"I am building many sensors using nodemcu and lua trying to connect everything together..\r\nHave a look at my other thingspeak pages:\r\nhttps://thingspeak.com/channels/145827\r\n\r\n\r\n\r\n","latitude":"53.723805","longitude":"-0.4319","field1":"Inside Temperature","field2":"Inside Humidity","field3":"Outside Light Level","field4":"Outside temp","field5":"Outside Humdity","field6":"Energy usage","field7":"Rain sensor","field8":"Inside gas level","created_at":"2015-11-07T14:22:02Z","updated_at":"2016-08-22T21:23:02Z","elevation":"25","last_entry_id":19221},"feeds":[{"created_at":"2016-08-22T19:43:02Z","entry_id":19122,"field1":"22","field2":"47","field3":"10.83","field4":"18.4","field5":"81.1","field6":"300","field7":"1","field8":"98"},{"created_at":"2016-08-22T19:44:02Z","entry_id":19123,"field1":"25","field2":"45","field3":"9.166","field4":"18.4","field5":"81.1","field6":"300","field7":"1","field8":"97"},{"created_at":"2016-08-22T19:45:03Z","entry_id":19124,"field1":"22","field2":"47","field3":"7.5","field4":"18.3","field5":"81.1","field6":"300","field7":"1","field8":"96"},{"created_at":"2016-08-22T19:46:02Z","entry_id":19125,"field1":"24","field2":"46","field3":"6.666","field4":"18.3","field5":"81.2","field6":"300","field7":"1","field8":"95"},{"created_at":"2016-08-22T19:47:03Z","entry_id":19126,"field1":"22","field2":"47","field3":"5.833","field4":"18.3","field5":"81.1","field6":"300","field7":"1","field8":"96"},{"created_at":"2016-08-22T19:48:02Z","entry_id":19127,"field1":"22","field2":"47","field3":"5","field4":"18.3","field5":"81.2","field6":"300","field7":"1","field8":"96"},{"created_at":"2016-08-22T19:49:02Z","entry_id":19128,"field1":"23","field2":"56","field3":"4.166","field4":"18.2","field5":"81.1","field6":"360","field7":"1","field8":"96"},{"created_at":"2016-08-22T19:50:02Z","entry_id":19129,"field1":"22","field2":"47","field3":"3.333","field4":"18.3","field5":"81.4","field6":"300","field7":"1","field8":"97"},{"created_at":"2016-08-22T19:51:02Z","entry_id":19130,"field1":"22","field2":"47","field3":"2.5","field4":"18.2","field5":"81.2","field6":"300","field7":"1","field8":"95"},{"created_at":"2016-08-22T19:52:02Z","entry_id":19131,"field1":"22","field2":"47","field3":"2.5","field4":"18.2","field5":"81.3","field6":"300","field7":"1","field8":"96"},{"created_at":"2016-08-22T19:53:02Z","entry_id":19132,"field1":"22","field2":"47","field3":"1.666","field4":"18.2","field5":"81.4","field6":"300","field7":"1","field8":"94"},{"created_at":"2016-08-22T19:54:03Z","entry_id":19133,"field1":"22","field2":"47","field3":"1.666","field4":"18.1","field5":"81.4","field6":"360","field7":"1","field8":"95"},{"created_at":"2016-08-22T19:55:02Z","entry_id":19134,"field1":"22","field2":"47","field3":"0.833","field4":"18.2","field5":"81.4","field6":"300","field7":"1","field8":"95"},{"created_at":"2016-08-22T19:56:03Z","entry_id":19135,"field1":"22","field2":"47","field3":"0.833","field4":"18.1","field5":"81.5","field6":"300","field7":"1","field8":"95"},{"created_at":"2016-08-22T19:57:02Z","entry_id":19136,"field1":"22","field2":"47","field3":"0.833","field4":"18.1","field5":"81.5","field6":"300","field7":"1","field8":"94"},{"created_at":"2016-08-22T19:58:03Z","entry_id":19137,"field1":"22","field2":"46","field3":"0.833","field4":"18.1","field5":"81.6","field6":"300","field7":"1","field8":"94"},{"created_at":"2016-08-22T19:59:02Z","entry_id":19138,"field1":"22","field2":"46","field3":"0","field4":"18","field5":"81.6","field6":"300","field7":"1","field8":"93"},{"created_at":"2016-08-22T20:00:03Z","entry_id":19139,"field1":"22","field2":"46","field3":"0","field4":"18","field5":"81.7","field6":"360","field7":"1","field8":"92"},{"created_at":"2016-08-22T20:01:02Z","entry_id":19140,"field1":"22","field2":"46","field3":"0","field4":"18","field5":"81.8","field6":"300","field7":"1","field8":"92"},{"created_at":"2016-08-22T20:02:02Z","entry_id":1914

TimothySpann · ‎08-19-2016

Use Case: Process a Media Feed, Store Everything, Run Sentiment Analysis on the Stream, and Act on a Condition I have my GetTwitter processor looking at my twitter handle and the keyword Hadoop, something I tend to tweet frequently. I use an EvaluateJsonPath to pull out all the attributes I like (msg, user name, geo information, etc...). I use a AttributesToJSON processor to make a new JSON file from just my attributes for a smaller tweet. I store the raw JSON data in HDFS as well in a separate directory. Sending HTML Email is a bit tricky, you need to make sure you don't include extra text, so nothing in the message but RAW HTML as seen below and don't Attach Files or Include All Attributes in Message. Make sure you set the Content Type to text/html. For sentiment analysis I wanted to run something easy, so I use an ExecuteStreamCommand to run a Python 2.7 script that uses NLTK Vader SentimentIntensityAnalyzer. The NiFi part is easy, just a command and call a shell script. The hard part is setting up Python and NLTK on the HDP 2.4 Sandbox. The NLTK with text corpus for proper analysis is almost 10 gigabytes of data. NLTK for Sentiment Analysis How to Analyze Sentiment If you don't have Python 2.7 or Python 3.4 installed on your box, as my VM had Python 2.6, you need to install Python 2.7 while keeping your existing Python 2.6 for existing application. This is a bit tricky so I have detailed these steps so you will be able to install and run this great ML tool. Directions on how to install Python 2.7 on Centos 6.x can be found here. More details can be found here. sudo yum install -y centos-release-SCL sudo yum install -y python27 sudo yum groupinstall "Development tools" -y sudo yum install zlib-devel -y sudo yum install bzip2-devel -y sudo yum install openssl-devel -y sudo yum install ncurses-devel -y sudo yum install sqlite-devel -y cd /opt sudo wget --no-check-certificate https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz sudo tar xf Python-2.7.6.tar.xz cd Python-2.7.6 sudo ./configure --prefix=/usr/local sudo make && sudo make altinstall Now we can use: /usr/local/bin/python2.7 /usr/local/bin/python2.7 get-pip.py wget https://bootstrap.pypa.io/get-pip.py sudo /usr/local/bin/pip2.7 install -U nltk sudo /usr/local/bin/pip2.7 install -U numpy sudo /usr/local/bin/python2.7 -m nltk.downloader -d /usr/local/share/nltk_data all (almost 10 gig of data) sudo /usr/local/bin/pip2.7 install vaderSentiment sudo /usr/local/bin/pip2.7 install twython The run.sh called from ExecuteStreamCommand. I use the BASH shell script since I want to make sure Python 2.7 is used. There are other ways, but this works for me. /usr/local/bin/python2.7 /opt/demo/sentiment/sentiment.py "$@" That script calls sentiment.py with parameters passed from NiFi. from nltk.sentiment.vader import SentimentIntensityAnalyzer import sys sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(sys.argv[1]) print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos'])) Once the data is in Hadoop, I was also running a Scala Spark 1.6 with Spark SQL batch job to process Stanford CoreNLP sentiment analysis on it as well. I also tried running that as a Scala Spark 1.6 Spark Streaming that did the same thing but received the data from Kafka (could also receive from NiFi Site-To-Site). Another option is to write a Processor in Java or Scala that can run that as part of the flow. With Apache NiFi, you have a lot of options depending on your needs, all get the features and benefits that only Apache NiFi provides. Now we have a bunch of data! Hooray, both raw and slimmed down. A select portion was converted to HTML and emailed out. Note, I have used Gmail and Outlook.com/Hotmail to send, but they tend to shut you down after a while for spam concerns. I use my own mail server (Dataflowdeveloper.com) since I have full control, you can use your corporate server as long as you have SMTP login and permissions. You may need to check with your administrators on that for firewall, ports and other security precautions. What to do with an HDFS directory full of same schema JSON files from Twitter? I also used a Spark batch job to produce an ORC Hive table with an extra column for Stanford Sentiment. You can quickly run queries on that via beeline, DBVisualizer or Ambari Hive View. beeline !connect jdbc:hive2://localhost:10000/default; !set showHeader true; set hive.vectorized.execution.enabled=true; set hive.execution.engine=tez; set hive.vectorized.execution.enabled =true; set hive.vectorized.execution.reduce.enabled =true; set hive.compute.query.using.stats=true; set hive.cbo.enable=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; show tables; describe sparktwitterorc; analyze table sparktwitterorc compute statistics; analyze table sparktwitterorc compute statistics for columns; I do a one-time compute statistics to enhance performance. That table is now ready for high-speed queries. You can also run a fast Hive query from the command-line: beeline -u jdbc:hive2://localhost:10000/default -e "SELECT * FROM rawtwitter where sentiment is not null and time like 'Thu Aug 18%' and lower(msg) like '%hadoop%' LIMIT 100;" Spark SQL says my data looks like: |-- coordinates: string (nullable = true) |-- followers_count: string (nullable = true) |-- friends_count: string (nullable = true) |-- geo: string (nullable = true) |-- handle: string (nullable = true) |-- hashtags: string (nullable = true) |-- language: string (nullable = true) |-- location: string (nullable = true) |-- msg: string (nullable = true) |-- place: string (nullable = true) |-- profile_image_url: string (nullable = true) |-- retweet_count: string (nullable = true) |-- sentiment: string (nullable = true) |-- source: string (nullable = true) |-- tag: string (nullable = true) |-- time: string (nullable = true) |-- time_zone: string (nullable = true) |-- tweet_id: string (nullable = true) |-- unixtime: string (nullable = true) |-- user_name: string (nullable = true) My Hive Table on this directory of tweets looks like: create table rawtwitter( handle string, hashtags string, msg string, language string, time string, tweet_id string, unixtime string, user_name string, geo string, coordinates string, location string, time_zone string, retweet_count string, followers_count string, friends_count string, place string, source string, profile_image_url string, tag string, sentiment string, stanfordSentiment string ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/social/twitter'; Now you can create charts and graphs from your social data.

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Using Apache NiFi 1.2 with MongoDB

Re: Accessing Facebook Page Data from Apache NiFi ...

Re: How to download JSON files from live feed?

Re: How to download JSON files from live feed?

Re: Parsing Apache Log Files with Spark

Re: Receiving AVRO Messages through KAFKA in a Spa...

Re: Receiving AVRO Messages through KAFKA in a Spa...

Re: How to download JSON files from live feed?

Re: How to download JSON files from live feed?

Processing Social Media Feeds in Stream with Apach...