About TimothySpann

TimothySpann · ‎05-07-2018

That is in com.dataflowdeveloper. It is a one method class I wrote to hold the string.

TimothySpann · ‎05-07-2018

Flow We can remove Sentiment if you don't want to install my custom processor: This is what they look like: I grab some fields I like: These are fields I want to save: This is a simple version of the flow to just ingest tweets, run sentiment analysis and store in directory as clean JSON. You can drop the sentiment analysis and do it later. You can also run a python script for that. We could make this simpler and just have GetTwitter then PutFile. This will store the RAW Twitter JSON file which is a very sparse nested JSON file. if you want the raw data, that is an option. It's a pain to work with that format and not perfect for analytics. I flatten it and just grab what I have seen as the core attributes, you can add more or drop some of them easily. This is a simple version that could be used for Art or Personal Projects or anyone who wants to store their own tweets and related items. Get Your Twitter ID: https://tweeterid.com/ Documentation: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object Create Your Application: https://apps.twitter.com/ https://apps.twitter.com/app/new Application Settings You Need: Consumer Key (API Key) and Consumer Secret (API Secret) Your Access Token You Need: Access Token and Access Token Secret. Make sure you keep the secrets secure as you don't want people tweeting in your name or reading your stuff. You will place these in the GetTwitter processor. Click start once you add that. You can filter out languages like en for English and es for Spanish. We just save these JSON files to a directory for later use. We could also aggregate them and compress them if you like. Or send them to an Amazon S3, email them, or whatever. We can also retweet those, but now we are getting fancy and we already wrote that article this morning. Custom Processor: https://github.com/tspannhw/nifi-corenlp-processor Example Tweet in JSON Stored: { "msg" : "RT @PaasDev Tim said @ApacheNiFi is awesome", "unixtime" : "1525724645676", "friends_count" : "5268", "sentiment" : "POSITIVE", "hashtags" : "[\"ApacheNiFi\"]", "listed_count" : "25", "tweet_id" : "993587294715203584", "user_name" : "Tim Spann", "favourites_count" : "5348", "source" : "NiFiTweetBot", "placename" : "", "media_url" : "[]", "retweet_count" : "0", "user_mentions_name" : "[]", "geo" : "", "urls" : "[]", "countryCode" : "", "user_url" : "", "place" : "", "timestamp" : "1525724645676", "coordinates" : "", "handle" : "PaasDev", "profile_image_url" : "http://pbs.twimg.com/profile_images/34343/34343.jpg", "time_zone" : "Eastern Time (US & Canada)", "ext_media" : "[]", "statuses_count" : "5994", "followers_count" : "1963", "location" : "Princeton, NJ", "time" : "Mon May 07 20:24:05 +0000 2018", "user_mentions" : "[]", "user_description" : "Tim NiFi Guy" } Download and Import to Apache NiFi This Template simplenifitwitter.xml Setup Get some Apache NiFi https://www.apache.org/dyn/closer.lua?path=/nifi/1.6.0/nifi-1.6.0-bin.zip Unzip it. On some Linux's you may need to apt-get install unzip or yum install unzip. You may need to be root, so you can do something like sudo su. You will need Java installed. For a low cost small Linux server, you can use one of these two services, and they also tell you how to install Java. There are many low cost options. This application is small enough to also run on your laptop, an old desktop PC or a small cloud instance. https://www.digitalocean.com/community/tutorials/how-to-install-java-on-centos-and-fedora https://www.linode.com/docs/development/java/install-java-on-centos/ https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-get-on-ubuntu-16-04 Generally something like this: sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer or sudo yum install java-1.8.0-openjdk-devel OpenJDK 8 or Oracle JDK 8 are perfect. You can also run some Docker containers if you like that sort of thing: https://github.com/minyk/nifi-sandbox You can also download one of the Hortonworks HDF 3.1 Sandboxes to run this as well: https://hortonworks.com/downloads/#sandbox Those have Apache NiFi and Java preinstalled! Here are some Docker Instructions: https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide/section/3/ https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.1.1/bk_installing-nifi/content/ch_nifi-installation.xml.html Resources: https://github.com/tspannhw?utf8=%E2%9C%93&tab=repositories&q=&type=source&language= https://community.hortonworks.com/articles/81270/adding-stanford-corenlp-to-big-data-pipelines-apac-1.html https://community.hortonworks.com/articles/80418/open-nlp-example-apache-nifi-processor.html https://community.hortonworks.com/articles/177370/extracting-html-from-pdf-excel-and-word-documents.html https://community.hortonworks.com/articles/76935/using-sentiment-analysis-and-nlp-tools-with-hdp-25.html https://community.hortonworks.com/articles/163776/parsing-any-document-with-apache-nifi-15-with-apac.html https://community.hortonworks.com/content/kbentry/189735/automating-social-media-sending-tweets-with-apache.html

TimothySpann · ‎05-07-2018

Automating Social Media: Sending Tweets with Apache NiFi: Building A Smart Bot This is the new processor, PutTwitterProcessor, it will use Twitter4J to send tweets from a parameter you set. This is an example tweet and my reply to it. It's a regular tweet. This is the result of a run. I put the tweet id and datetime stamp in the result attribute. Another example: You need to create a Twitter application and obtain the Consumer Key, Consumer Secret, Access Token and Access secret. If you have a lat and long in your data you can send that. It's optional. The message takes expression language and lets you build a message. JUnit Test of my Processor JUnit Run Results Most of the code is using the very easy to use Twitter4J library. So what can I do with this? In my flow I am ingesting Tweets and with a streaming SQL query I can retweet ones where there's no existing retweet and the followers of the tweet are more than 1,000. I can also feed all this information to a Hive table and then run some machine learning on it to figure out other parameters to filter on for my bot. This is a start. I put a scheduler on my PutTweet to only tweet every 10 minutes (600 seconds). Let's see how this goes. SQL for QueryRecord SELECT * FROM FLOWFILE WHERE CAST(retweet_count AS DOUBLE) <= 0 AND CAST(followers_count AS DOUBLE) > 1000 Example Message Building ${'user_name':append( ' '):append( ${'location'} ):append(' '):append( ${'hashtags'} ):append(' ' ):append(${'msg'})} Source: https://github.com/tspannhw/nifi-puttwitter-processor Download NAR to install in Apache NiFi lib directory and then restart: https://github.com/tspannhw/nifi-puttwitter-processor/releases/tag/1.0

TimothySpann · ‎05-07-2018

My custom processor is pretty easy to customize. https://github.com/tspannhw/nifi-extracttext-processor You can tweak it to extract just somethings, Apache Tika is very powerful.

TimothySpann · ‎05-05-2018

Tracking Air Quality with HDP and HDF: Part 1 - Apache NiFi Ingest Part 2: Plan Data Storage. Store to Apache Hive, Apache Druid and Apache HBase. Part 3: Query and Visualize Data with Apache Zeppelin and Superset There was an Air Quality alert a few days ago near me and I was curious how I could keep track of this important environmental information. So NiFi! This data is different from weather data, but makes a lot of sense for analytics to add in data from Weather, Social and locally captured cameras. It's very easy to ingest these JSON and Camera Images via Apache NiFi. In the next section we will analyze the datasets and determine how we can aggregate and accumulate massive quantities of this data for tracking air quality in various areas over time and use that as a dimension with other relevant data like weather. We are tracking contaminants and particles in the air. These include: pm25, pm10 - atmospheric particulate matter so2 - sulfur dioxide no2 - nitrogen dioxide o3 - ozone co - carbon monoxide Photos Courtesy of HazeCam - Brigantine, NJ Example Data {"location":"ARB OER","city":"CA8 - ARB","country":"US","distance":3848728.319714322,"measurements":[{"parameter":"pm25","value":-4,"lastUpdated":"2016-08-08T16:00:00.000Z","unit":"µg/m³","sourceName":"AirNow","averagingPeriod":{"value":1,"unit":"hours"}}],"coordinates":{"latitude":38.568504,"longitude":-121.493256}} { "location" : "MONTG", "parameter" : "o3", "date" : { "utc" : "2018-05-05T12:00:00.000Z", "local" : "2018-05-05T06:00:00-06:00" }, "value" : 0.004, "unit" : "ppm", "coordinates" : { "latitude" : 32.4069, "longitude" : -86.2564 }, "country" : "US", "city" : "Montgomery" } Most of the data is arrays of JSON, so we can easily break that down into individual JSON records, derive an AVRO Schema from that data and then process it as we want. We can join them together and then convert into ORC files or HBase rows. Data Feed Links Haze Cam Provides Web Camera Images of Potential Haze http://hazecam.net/images/main/brigantine_right.jpg OpenAQ (https://openaq.org/#/?_k=7mfsz6) Provides Open Air Quality Data https://api.openaq.org/v1/latest?country=US https://api.openaq.org/v1/measurements?country=US&date_from=2018-05-04 Air NOW API (Provides forecasts and current conditions) http://www.airnowapi.org/aq/observation/zipCode/current/?format=application/json&zipCode=08520&distance=50&API_KEY=SIGNUPFORANAPIKEY http://www.airnowapi.org/aq/forecast/zipCode/?format=application/json&zipCode=08520&date=2018-05-02&distance=25&API_KEY=SIGNUPFORANAPIKEY EPA's Air Quality Notifications http://feeds.enviroflash.info/ https://www.airnow.gov/index.cfm?action=airnow.national http://feeds.enviroflash.info/rss/realtime/445.xml Other Sources http://feeds.enviroflash.info/cap/aggregate.xml https://docs.openaq.org/

TimothySpann · ‎05-03-2018

Converting CSV Files to Apache Hive Tables with Apache ORC Files I received some CSV files of data to load into Apache Hive. There are many ways to do this, but I wanted to see how easy it was to do in Apache NiFi with zero code. I read CSV files from a directory of files. Then I can Convert the CSV to AVRO directly with ConvertRecord. I will need a schema, so I use the below settings for InferAvroSchema. if ever file is different, you will need to do this every time. CSV Reader I use the Jackson CSV parser which works very well. The first line of the CSV is a header. It can figure out the fields from the header. Once I have an Apache AVRO file it's easy to convert to Apache ORC and then store in HDFS. Template: csvprocess.xml

TimothySpann · ‎04-27-2018

ETL With Lookups with Apache HBase and Apache NiFi (Microservices Style ETL) When we are ingesting tabular / record-oriented data, we often want to enrich the data by replacing ids with descriptions or visa-versa. There are many transformations that may need to happen before the data is in a happy state. When you are denormalizing your data in Hadoop and usually building very wide tables you often want descriptions or other data to enhance it's usability. Only one call to get everything you need is nice, especially when you have 100 trillion records. We are utilizing a lot of things built already (https://community.hortonworks.com/articles/146198/data-flow-enrichment-with-nifi-part-3-lookuprecord.html). Make sure you read Abdelkrim's first 3 lookup articles. I added some fields to his generated data for testing. I want to do my lookups against HBase which is a great NoSQL store for lookup tables and generate datasets. First I created an HBase Table to use for lookups. Create HBase Table For Lookups create 'lookup_', 'family' Table With Data Most people would have a pre-populated table for lookups. I don't and since we are using a generator to build the lookup ids, I am building the lookup descriptions with a REST CALL at the same time. We could also have a flow that if you don't find the lookup add it, we could also have another flow ingesting the lookup values and add/update those when needed. REST API To Generate Product Descriptions https://baconipsum.com/api/?type=meat&sentences=1&format=text I found this cool API that returns a sentence of meat words. I use this as our description, because MEAT! Call the Bacon API!!! Let's turn our plain text into a clean JSON document Then I store it in HBase as my lookup table. You probably already have a lookup table. This is a demo and I am filling it with my generator. This is not a best practice or a good design pattern. This is a lazy way to populate a table. Example Apache NiFi Flow (Using Apache NiFi 1.5) Generate Some Test Data (https://community.hortonworks.com/articles/146198/data-flow-enrichment-with-nifi-part-3-lookuprecord.html) Generate A Json Document (Note the Empty prod_desc) { "ts" : "${now():format('yyyymmddHHMMSS')}", "updated_dt" : "${now()}", "id_store" : ${random():mod(5):toNumber():plus(1)}, "event_type" : "generated", "uuid" : "${UUID()}", "hostname" : "${hostname()}", "ip" : "${ip()}", "counter" : "${nextInt()}", "id_transaction" : "${random():toString()}", "id_product" : ${random():mod(500000):toNumber()}, "value_product" : ${now():toNumber()}, "prod_desc": "" } Lookup Your Record This is the magic. We take in our records, in this case we are reading JSON records and writing JSON records, we could choose CSV, AVRO or others. We connect to the HBase Record Lookup Service. We replace the current prod_desc field in the record with what is returned by the lookup. We use the id_product field as the lookup key. There is nothing else needed to change records in stream. HBase Record Lookup Service HBase Client Service Used by HBase Record Lookup Service We can use UpdateRecord to cleanup, transform or modify any field in the records in stream. Original File { "ts" : "201856271804499", "updated_dt" : "Fri Apr 27 18:56:15 UTC 2018", "id_store" : 1, "event_type" : "generated", "uuid" : "0d16967d-102d-4864-b55a-3f1cb224a0a6", "hostname" : "princeton1", "ip" : "172.26.217.170", "counter" : "7463", "id_transaction" : "5307056748245491959", "id_product" : 430672, "value_product" : 1524855375500, "prod_desc": "" } Final File (Note we have populated prod_desc with MEAT!) [ { "ts" : "201856271804499", "prod_desc" : "Pork chop leberkas brisket chuck, filet mignon turducken hamburger.", "updated_dt" : "Fri Apr 27 18:56:15 UTC 2018", "id_store" : 1, "event_type" : "generated", "uuid" : "0d16967d-102d-4864-b55a-3f1cb224a0a6", "hostname" : "princeton1", "ip" : "172.26.217.170", "counter" : "7463", "id_transaction" : "5307056748245491959", "id_product" : 430672, "value_product" : 1524855375500 } ] References: https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte.html https://community.hortonworks.com/articles/155527/ingesting-golden-gate-records-from-apache-kafka-an.html https://community.hortonworks.com/questions/174144/lookuprecord-and-simplecsvfilelookupservice-in-nif.html https://community.hortonworks.com/articles/138632/data-flow-enrichment-with-nifi-lookuprecord-proces.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html For those wishing to not include meat in their data, there are alternatives: https://www.vegguide.org/site/api-docs Example Flow etlv2.xml

TimothySpann · ‎04-25-2018

You can use executesql You can use sqoop for initial export https://community.hortonworks.com/articles/108718/ingesting-rdbms-data-as-new-tables-arrive-automagi.html

TimothySpann · ‎04-24-2018

Vision Thing Part 3: Image Analytics Open Source Computer Vision with TensorFlow, Apache MiniFi, Apache NiFi, OpenCV, Apache Tika and Python In preparation for this talk, I am releasing some articles detailing how to work with images. In this one For Linux machines I recommend building OpenCV yourself and installing the Python connector. sudo yum install -y https://centos7.iuscommunity.org/ius-release.rpm sudo yum update -y sudo yum groupinstall 'Development Tools' -y sudo yum install cmake git pkgconfig -y sudo yum install libpng-devel libjpeg-turbo-devel jasper-devel openexr-devel libtiff-devel libwebp-devel -y sudo yum install libdc1394-devel libv4l-devel gstreamer-plugins-base-devel -y sudo yum install gtk2-devel -y sudo yum install tbb-devel eigen3-devel -y sudo yum install -y python36u python36u-libs python36u-devel python36u-pip -y pip3.6 install numpy cd ~ git clone https://github.com/Itseez/opencv.git cd opencv git checkout 3.1.0 git clone https://github.com/Itseez/opencv_contrib.git cd opencv_contrib git checkout 3.1.0 cd ~/opencv mkdir build cd build cmake -D CMAKE_BUILD_TYPE=RELEASE \ -D CMAKE_INSTALL_PREFIX=/usr/local \ -D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \ -D INSTALL_C_EXAMPLES=OFF \ -D INSTALL_PYTHON_EXAMPLES=ON \ -D BUILD_EXAMPLES=ON \ -D BUILD_OPENCV_PYTHON2=ON -D BUILD_OPENCV_PYTHON3=ON .. sudo make sudo make install sudo ldconfig pip3.6 install opencv-python

TimothySpann · ‎04-23-2018

I would recommend trying the Apache NiFi executesparkinteractive processor https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte.html

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: Data Processing Pipeline: Parsing PDFs and Id...

Processing Real-Time Social Media (Twitter) with A...

Automating Social Media: Sending Tweets with Apa...

Re: How to configure Extracting Text custom proces...

Tracking Air Quality with HDP and HDF: Part 1 - ...

Converting CSV Files to Apache Hive Tables with Ap...

ETL With Lookups with Apache HBase and Apache NiFi

Re: Ingesting SQL Server Tables into Hive via Apac...

Vision Thing Part 3: Image Analytics

Re: Using nifi & pyspark to move & transform data ...