1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2458 | 04-03-2024 06:39 AM | |
| 3807 | 01-12-2024 08:19 AM | |
| 2053 | 12-07-2023 01:49 PM | |
| 3037 | 08-02-2023 07:30 AM | |
| 4156 | 03-29-2023 01:22 PM |
11-22-2017
03:39 PM
Sure, can be from anywhere you want for REST. GET or POST.
... View more
08-19-2016
12:19 AM
3 Kudos
Use Case: Process a Media Feed, Store Everything, Run Sentiment Analysis on the Stream, and Act on a Condition I have my GetTwitter processor looking at my twitter handle and the keyword Hadoop, something I tend to tweet frequently. I use an EvaluateJsonPath to pull out all the attributes I like (msg, user name, geo information, etc...). I use a AttributesToJSON processor to make a new JSON file from just my attributes for a smaller tweet. I store the raw JSON data in HDFS as well in a separate directory. Sending HTML Email is a bit tricky, you need to make sure you don't include extra text, so nothing in the message but RAW HTML as seen below and don't Attach Files or Include All Attributes in Message. Make sure you set the Content Type to text/html. For sentiment analysis I wanted to run something easy, so I use an ExecuteStreamCommand to run a Python 2.7 script that uses NLTK Vader SentimentIntensityAnalyzer. The NiFi part is easy, just a command and call a shell script. The hard part is setting up Python and NLTK on the HDP 2.4 Sandbox. The NLTK with text corpus for proper analysis is almost 10 gigabytes of data. NLTK for Sentiment Analysis How to Analyze Sentiment If you don't have Python 2.7 or Python 3.4 installed on your box, as my VM had Python 2.6, you need to install Python 2.7 while keeping your existing Python 2.6 for existing application. This is a bit tricky so I have detailed these steps so you will be able to install and run this great ML tool. Directions on how to install Python 2.7 on Centos 6.x can be found here. More details can be found here. sudo yum install -y centos-release-SCL
sudo yum install -y python27
sudo yum groupinstall "Development tools" -y
sudo yum install zlib-devel -y
sudo yum install bzip2-devel -y
sudo yum install openssl-devel -y
sudo yum install ncurses-devel -y
sudo yum install sqlite-devel -y
cd /opt
sudo wget --no-check-certificate https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
sudo tar xf Python-2.7.6.tar.xz
cd Python-2.7.6
sudo ./configure --prefix=/usr/local
sudo make && sudo make altinstall
Now we can use: /usr/local/bin/python2.7
/usr/local/bin/python2.7 get-pip.py
wget https://bootstrap.pypa.io/get-pip.py
sudo /usr/local/bin/pip2.7 install -U nltk
sudo /usr/local/bin/pip2.7 install -U numpy
sudo /usr/local/bin/python2.7 -m nltk.downloader -d /usr/local/share/nltk_data all
(almost 10 gig of data)
sudo /usr/local/bin/pip2.7 install vaderSentiment
sudo /usr/local/bin/pip2.7 install twython
The run.sh called from ExecuteStreamCommand. I use the BASH shell script since I want to make sure Python 2.7 is used. There are other ways, but this works for me. /usr/local/bin/python2.7 /opt/demo/sentiment/sentiment.py "$@" That script calls sentiment.py with parameters passed from NiFi. from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos']))
Once the data is in Hadoop, I was also running a Scala Spark 1.6 with Spark SQL batch job to process Stanford CoreNLP sentiment analysis on it as well. I also tried running that as a Scala Spark 1.6 Spark Streaming that did the same thing but received the data from Kafka (could also receive from NiFi Site-To-Site). Another option is to write a Processor in Java or Scala that can run that as part of the flow. With Apache NiFi, you have a lot of options depending on your needs, all get the features and benefits that only Apache NiFi provides. Now we have a bunch of data! Hooray, both raw and slimmed down. A select portion was converted to HTML and emailed out. Note, I have used Gmail and Outlook.com/Hotmail to send, but they tend to shut you down after a while for spam concerns. I use my own mail server (Dataflowdeveloper.com) since I have full control, you can use your corporate server as long as you have SMTP login and permissions. You may need to check with your administrators on that for firewall, ports and other security precautions. What to do with an HDFS directory full of same schema JSON files from Twitter? I also used a Spark batch job to produce an ORC Hive table with an extra column for Stanford Sentiment. You can quickly run queries on that via beeline, DBVisualizer or Ambari Hive View. beeline
!connect jdbc:hive2://localhost:10000/default;
!set showHeader true;
set hive.vectorized.execution.enabled=true;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled =true;
set hive.vectorized.execution.reduce.enabled =true;
set hive.compute.query.using.stats=true;
set hive.cbo.enable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
show tables;
describe sparktwitterorc;
analyze table sparktwitterorc compute statistics;
analyze table sparktwitterorc compute statistics for columns;
I do a one-time compute statistics to enhance performance. That table is now ready for high-speed queries. You can also run a fast Hive query from the command-line: beeline -u jdbc:hive2://localhost:10000/default -e "SELECT * FROM rawtwitter where sentiment is not null and time like 'Thu Aug 18%' and lower(msg) like '%hadoop%' LIMIT 100;"
Spark SQL says my data looks like: |-- coordinates: string (nullable = true)
|-- followers_count: string (nullable = true)
|-- friends_count: string (nullable = true)
|-- geo: string (nullable = true)
|-- handle: string (nullable = true)
|-- hashtags: string (nullable = true)
|-- language: string (nullable = true)
|-- location: string (nullable = true)
|-- msg: string (nullable = true)
|-- place: string (nullable = true)
|-- profile_image_url: string (nullable = true)
|-- retweet_count: string (nullable = true)
|-- sentiment: string (nullable = true)
|-- source: string (nullable = true)
|-- tag: string (nullable = true)
|-- time: string (nullable = true)
|-- time_zone: string (nullable = true)
|-- tweet_id: string (nullable = true)
|-- unixtime: string (nullable = true)
|-- user_name: string (nullable = true)
My Hive Table on this directory of tweets looks like: create table rawtwitter(
handle string,
hashtags string,
msg string,
language string,
time string,
tweet_id string,
unixtime string,
user_name string,
geo string,
coordinates string,
location string,
time_zone string,
retweet_count string,
followers_count string,
friends_count string,
place string,
source string,
profile_image_url string,
tag string,
sentiment string,
stanfordSentiment string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/social/twitter';
Now you can create charts and graphs from your social data.
... View more
Labels:
08-11-2016
01:02 PM
2 Kudos
With Apache NiFi 1.0 you can now act as a simple SMTP server (though it is recommended to sit behind a real SMTP MTA and just get mail forwards). It makes for an easy way to ingest mail, headers and attachments. The first thing you will notice is the awesome new UI, which is much cleaner and a joy to use.' First add a processor, ListenSMTP, this will be your mail gateway/SMTP server. As you can see there's also processors for extracting attachments and headers from Email. You need to make sure you set Listening Port, SMTP hostname and Max. # of Connections. The entire flow for mail processing is pretty simple, but easy to follow. We listen for SMTP over TCP Port (I chose 2025, but with Root access you could run on 25). I send the original flow file right to HDFS. I extract the attachments and put them in a separate HDFS directory and finally pull out the email headers and also send them to an HDFS file. I have a little test flow in the bottom to read a file and send email to our ListenSMTP for testing. If you are running this on an HDP 2.4 sandbox, you will need to install Java 8 and set it as an alternative JDK. http://tecadmin.net/install-java-8-on-centos-rhel-and-fedora/ alternatives --config java Pick Java 8 I added
Java 8 as an alternative and specified Java_HOME in top of bin/nifi.sh so I
could run with Java 8 which is required now. To send a test SMTP message from the command line: telnet localhost 2025
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 sandbox.hortonworks.com ESMTP Apache NiFi
ehlo sandbox
250-sandbox.hortonworks.com
250-8BITMIME
250-SIZE 67108864
250 Ok
MAIL
FROM: <tim@sparkdeveloper.com>
250 Ok
RCPT TO: <tspann@hortonworks.com>
250 Ok
DATA
354 End data with
<CR><LF>.<CR><LF>
hello
.
250 Ok
A better way to test SMTP is with SWAKS See: https://debian-administration.org/article/633/Testing_SMTP_servers_with_SWAKS From Mac: brew install swaks From Centos/RHEL: sudo yum -y install swaks Test Send Email: swaks --to tspann@hortonworks.com --server localhost:2025
Received: from hw13125.home (localhost [127.0.0.1])
by sandbox.hortonworks.com with SMTP (Apache NiFi) id IRPEF4WI
for tspann@hortonworks.com; Wed, 10 Aug 2016 17:19:12 -0400 (EDT)
Date: Wed, 10 Aug 2016 17:19:12 -0400To:
tspann@hortonworks.com
From: tspann@hw13125.home
Subject: test Wed,
10 Aug 2016 17:19:12 -0400
X-Mailer: swaks v20130209.0 jetmore.org/john/code/swaks/
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_MIME_BOUNDARY_000_98059"------=_MIME_BOUNDARY_000_98059
Content-Type: text/plain
This is a test mailing
------=_MIME_BOUNDARY_000_98059
Content-Type: application/octet-stream
Content-Disposition: attachment
Content-Transfer-Encoding:
BASE64
------=_MIME_BOUNDARY_000_98059--
It is very easy to configure send an email message to our server you need to put in a hostname and port. Once your down building your flow, make sure you create a template and save the XML off to version control. Creating a template has now moved to the Operate control. If you get lost on what you are working for you can use the search feature from the top right. Remember this is a beta product, not yet ready for production. Wait for HDF 2.0 for supported production usage.
... View more
Labels:
08-04-2016
01:24 PM
What jars are needed? Can you attach a NiFi template
... View more
08-02-2016
12:08 AM
2 Kudos
Tips Read each processor
directions carefully, inputs and outputs vary greatly from attributes, JSON, JSONPath statements and other
entries. JSONPath supports a radically different syntax than NIFI Expression Language. Know your Expression
language, there are a lot of useful
functions like UUID() and nextint(). When you are doing a lot of streaming and testing data, you will find that logs grow huge on your Centos sandbox. I built up gigabytes of logs in
Hive, Hadoop and NiFi in just a few days of operations. Check your /var/log every few days and watch the Ambari screens for data usage. When your data gets too large, things can start failing or slowing down due to a lack of temporary space, swap space and space for logs. Delete Test Files
For Ever! hdfs dfs -rm -f -skipTrash /twitter/*.json Clean up junk files. cd /
du -hsx * | sort -rh | head -10 ) PutHiveQL requires
hiveql.args.1.value and hiveql.args.1.type for all fields. If you are using EvaluateJSONPath, you cannot
set your types there. Do those in an
UpdateAttribute processor after that.
Then you need to turn your code in SQL for Hive to execute.
... View more
Labels:
07-28-2016
08:34 PM
5 Kudos
There are a lot of excellent talks from the summit. Deep Learning Apache Spark with Machine Learning like TensorFlow Distributed Deep Learning on Hadoop Clusters (Yahoo) Apache Spark Big Data Heterogeneous Mixture Learning on Spark
Integrating Apache Spark and NiFi for Data Lakes (ThinkBig) Operations Zero Downtime App Deployment Using Hadoop (Hortonworks) Debugging YARN Cluster in Production (Hortonworks) Yahoo's Experience Running Pig on Tez at Scale The DAP Where Yarn HBase Kafka and Spark go to Production (Cask) Extend Governance in Hadoop with Atlas Ecosystem (Hortonworks) Cost and Resource Tracking for Hadoop (Yahoo) Managing Hadoop HBase and Storm Clusters at Yahoo Scale Operating and Supporting Apache HBase Best Practices and Improvements (Hortonworks) Scheduling Policies in Yarn (Slides) Future of Data Arun Murthy, Hortonworks - Hadoop Summit 2016 San Jose - #HS16SJ - #theCUBE The Future of Hadoop : An Enterprise View (Slides) Streaming The Future of Storm (Hortonworks) Streaming ETL for All : Embeddable Data Transformation for Real Time Streams Real-time, Streaming Advanced Analytics, Approximations, and Recommendations using Apache Spark ML/GraphX, Kafka Stanford CoreNLP, and Twitter Algebird (Chris Fregly, IBM) Fighting Fraud in Real Time by Processing 1Mplus TPS Using Storm on Slider YARN (Rocketfuel) Lambda less Stream Processing at Scale in LinkedIn Make Streaming Analytics Work For You The Devil is in the Details (Hortonworks) Lego Like Building Blocks of Storm and Spark Streaming Pipelines for Rapid IOT and Streaming (StreamAnalytix) Performance Comparison of Streaming Big Data Platforms Machine Learning
Prescient Keeps Travelers Safe with Natural Language Processing and Geospatial Analytics
IoAT Internet Of Things What about Data Storage (Hortonworks) YAF (Yet Another Framework) Apache Beam A Unified Model for Batch and Streaming Data Processing (Google) Turning the Stream Processor into a Database Building Online Applications on Streams (Flink / DataArtisans) The Next Generation of Data Processing OSS (Google) Next Gen Big Data Analytics with Apache Apex SQL and Friends How We Re Engineered Phoenix with a Cost Based Optimizer Based on Calcite (Intel and Hortonworks) Hive Hbase Metastore Improving Hive with a Big Data Metadata Storage (Hortonworks) Phoenix plus HBase An Enterprise Grade Data Warehouse Appliance for Interactive Analytics (Hortonworks) Presto Whats New in SQL on Hadoop and Beyond (Facebook, Teradata) DataFlow Scalable Optical Character Recognition with Apache NiFi and Tesseract (Hortonworks) Building a Smarter Home with Nifi and Spark General
Its Time Launching Your Advanced Analytics Program for Success in a Mature Industry Like Oil and Gas (Conoco Phillips) Instilling Confidence and Trust Big Data Security Governance (Mastercard) Hadoop in the Cloud The What Why and How from the Experts (Microsoft) War on Stealth Cyberattacks that Target Unknown Vulnerabilities Hadoop and Cloud Storage Object Store Integration in Production (Hortonworks) There is a New Ranger in Town End to End Security and Auditing in a Big Data as a Service Deployment Building A Scalable Data Science Platform with R (Microsoft) A Data Lake and a Data Lab to Optimize Operations and Safety Within a Nuclear Fleet Reliable and Scalable Data Ingestion at Airbnb
... View more
Labels:
05-22-2018
12:55 PM
I'm collecting facebook data using NIFI. Using which processors ( and configurations) and how to modify the query to get more next feeds from the response( Graph API) . I'm getting the first 100 posts and after that a link to the next 100 posts how to manage a dynamic process to get contunually dataflow from facebook.
... View more
05-22-2017
01:26 PM
@Timothy Spann Were you able to configure TOAD with a kerberized cluster?
... View more
07-21-2016
10:13 PM
2 Kudos
Using the GetHTTP Processor we grab random images from the DigitalOcean's Unsplash.it free image site. I give it a random file name so we can save it uniquely in HDFS. The Entire Data Flow from GetHTTP to Final HDFS storage of image and it's metadata as JSON. ExtractMediaMetaData Processor The final results: hdfs dfs -cat /mediametadata/random1469112881039.json
{"Number of Components":"3","Resolution Units":"none","Image Height":"200
pixels","File Name":"apache-tika-3181704319795384377.tmp",
"Data Precision":"8 bits",
"File Modified Date":"Thu Jul 21 14:54:43 UTC 2016","tiff:BitsPerSample":"8",
"Compression Type":"Progressive,Huffman","X-Parsed-By":"org.apache.tika.parser.DefaultParser,
org.apache.tika.parser.jpeg.JpegParser",
"Component 1":"Y component: Quantization table 0, Sampling factors 2 horiz/2vert",
"Component 2":"Cb component: Quantization table 1,Sampling factors 1 horiz/1 vert",
"tiff:ImageLength":"200","mime.type":"image/jpeg","gethttp.remote.source":"unsplash.it",
"Component3":"Cr component: Quantization table 1, Sampling factors 1 horiz/1vert",
"X Resolution":"1 dot",
"FileSize":"4701 bytes","tiff:ImageWidth":"200","path":"./",
"filename":"random1469112881039.jpg","ImageWidth":"200 pixels",
"uuid":"8b7c4f9f-9436-4ccb-b06e-9a720c91f6e0",
"Content-Type":"image/jpeg",
"YResolution":"1 dot"}
We have as many images as we want. Using the Unsplash.it parameters I picked an image width of always 200. You can customize that. Below is the image downloaded with the above metadata.
... View more
Labels:
11-27-2017
01:43 PM
1 Kudo
Hello, Great post need to correct this part :
sudo wget http://download.opensuse.org/repositories/home:/oojah:/mqtt/CentOS_CentOS-6/home:oojah:mqtt.repo sudo cp *.repo /etc/yum.repos.d/ sudo yum -y update sudo yum -y install mosquitto step 1 and 2 are fused. Regards
... View more