1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
843 | 04-03-2024 06:39 AM | |
1619 | 01-12-2024 08:19 AM | |
802 | 12-07-2023 01:49 PM | |
1388 | 08-02-2023 07:30 AM | |
2004 | 03-29-2023 01:22 PM |
09-08-2016
06:30 AM
The more the nodes in a ZK ensemble (quorum) the faster the reads but the slower the writes. That's because a read can be done from any node, but a write is not complete before all nodes are updated. On top of that, early versions of Kafka (0.8.2 and older) keep Kafka offsets on ZK. Therefore, as already suggested by @mqureshi, it's the best to start by creating a dedicated ZK for Kafka, I'd go for 3 nodes, and keep the 5-node ZK for everything else. Beefing up the number of ZK's to 7 or more is a resounding No. Regarding the installation and management of the new Kafka ZK, it's pretty straightforward to install it manually, just follow the steps in one of "Non-Ambari cluster installation guides" like this one. You can also try to create a cluster composed of only Kafka and ZK and manage it by its own Ambari instance.
... View more
05-01-2017
04:26 PM
Thanks a lot for this article. What are you using to run TF on Spark in this configuration?
... View more
11-22-2017
03:39 PM
Sure, can be from anywhere you want for REST. GET or POST.
... View more
08-23-2016
07:44 PM
Second option didn't work. I have installed jdk, need to check whether that might solve the issue.
... View more
06-28-2018
03:49 PM
In my experience, the connection error goes away if you remove "thrift://" from the URI.
... View more
08-19-2016
12:19 AM
3 Kudos
Use Case: Process a Media Feed, Store Everything, Run Sentiment Analysis on the Stream, and Act on a Condition I have my GetTwitter processor looking at my twitter handle and the keyword Hadoop, something I tend to tweet frequently. I use an EvaluateJsonPath to pull out all the attributes I like (msg, user name, geo information, etc...). I use a AttributesToJSON processor to make a new JSON file from just my attributes for a smaller tweet. I store the raw JSON data in HDFS as well in a separate directory. Sending HTML Email is a bit tricky, you need to make sure you don't include extra text, so nothing in the message but RAW HTML as seen below and don't Attach Files or Include All Attributes in Message. Make sure you set the Content Type to text/html. For sentiment analysis I wanted to run something easy, so I use an ExecuteStreamCommand to run a Python 2.7 script that uses NLTK Vader SentimentIntensityAnalyzer. The NiFi part is easy, just a command and call a shell script. The hard part is setting up Python and NLTK on the HDP 2.4 Sandbox. The NLTK with text corpus for proper analysis is almost 10 gigabytes of data. NLTK for Sentiment Analysis How to Analyze Sentiment If you don't have Python 2.7 or Python 3.4 installed on your box, as my VM had Python 2.6, you need to install Python 2.7 while keeping your existing Python 2.6 for existing application. This is a bit tricky so I have detailed these steps so you will be able to install and run this great ML tool. Directions on how to install Python 2.7 on Centos 6.x can be found here. More details can be found here. sudo yum install -y centos-release-SCL
sudo yum install -y python27
sudo yum groupinstall "Development tools" -y
sudo yum install zlib-devel -y
sudo yum install bzip2-devel -y
sudo yum install openssl-devel -y
sudo yum install ncurses-devel -y
sudo yum install sqlite-devel -y
cd /opt
sudo wget --no-check-certificate https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
sudo tar xf Python-2.7.6.tar.xz
cd Python-2.7.6
sudo ./configure --prefix=/usr/local
sudo make && sudo make altinstall
Now we can use: /usr/local/bin/python2.7
/usr/local/bin/python2.7 get-pip.py
wget https://bootstrap.pypa.io/get-pip.py
sudo /usr/local/bin/pip2.7 install -U nltk
sudo /usr/local/bin/pip2.7 install -U numpy
sudo /usr/local/bin/python2.7 -m nltk.downloader -d /usr/local/share/nltk_data all
(almost 10 gig of data)
sudo /usr/local/bin/pip2.7 install vaderSentiment
sudo /usr/local/bin/pip2.7 install twython
The run.sh called from ExecuteStreamCommand. I use the BASH shell script since I want to make sure Python 2.7 is used. There are other ways, but this works for me. /usr/local/bin/python2.7 /opt/demo/sentiment/sentiment.py "$@" That script calls sentiment.py with parameters passed from NiFi. from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos']))
Once the data is in Hadoop, I was also running a Scala Spark 1.6 with Spark SQL batch job to process Stanford CoreNLP sentiment analysis on it as well. I also tried running that as a Scala Spark 1.6 Spark Streaming that did the same thing but received the data from Kafka (could also receive from NiFi Site-To-Site). Another option is to write a Processor in Java or Scala that can run that as part of the flow. With Apache NiFi, you have a lot of options depending on your needs, all get the features and benefits that only Apache NiFi provides. Now we have a bunch of data! Hooray, both raw and slimmed down. A select portion was converted to HTML and emailed out. Note, I have used Gmail and Outlook.com/Hotmail to send, but they tend to shut you down after a while for spam concerns. I use my own mail server (Dataflowdeveloper.com) since I have full control, you can use your corporate server as long as you have SMTP login and permissions. You may need to check with your administrators on that for firewall, ports and other security precautions. What to do with an HDFS directory full of same schema JSON files from Twitter? I also used a Spark batch job to produce an ORC Hive table with an extra column for Stanford Sentiment. You can quickly run queries on that via beeline, DBVisualizer or Ambari Hive View. beeline
!connect jdbc:hive2://localhost:10000/default;
!set showHeader true;
set hive.vectorized.execution.enabled=true;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled =true;
set hive.vectorized.execution.reduce.enabled =true;
set hive.compute.query.using.stats=true;
set hive.cbo.enable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
show tables;
describe sparktwitterorc;
analyze table sparktwitterorc compute statistics;
analyze table sparktwitterorc compute statistics for columns;
I do a one-time compute statistics to enhance performance. That table is now ready for high-speed queries. You can also run a fast Hive query from the command-line: beeline -u jdbc:hive2://localhost:10000/default -e "SELECT * FROM rawtwitter where sentiment is not null and time like 'Thu Aug 18%' and lower(msg) like '%hadoop%' LIMIT 100;"
Spark SQL says my data looks like: |-- coordinates: string (nullable = true)
|-- followers_count: string (nullable = true)
|-- friends_count: string (nullable = true)
|-- geo: string (nullable = true)
|-- handle: string (nullable = true)
|-- hashtags: string (nullable = true)
|-- language: string (nullable = true)
|-- location: string (nullable = true)
|-- msg: string (nullable = true)
|-- place: string (nullable = true)
|-- profile_image_url: string (nullable = true)
|-- retweet_count: string (nullable = true)
|-- sentiment: string (nullable = true)
|-- source: string (nullable = true)
|-- tag: string (nullable = true)
|-- time: string (nullable = true)
|-- time_zone: string (nullable = true)
|-- tweet_id: string (nullable = true)
|-- unixtime: string (nullable = true)
|-- user_name: string (nullable = true)
My Hive Table on this directory of tweets looks like: create table rawtwitter(
handle string,
hashtags string,
msg string,
language string,
time string,
tweet_id string,
unixtime string,
user_name string,
geo string,
coordinates string,
location string,
time_zone string,
retweet_count string,
followers_count string,
friends_count string,
place string,
source string,
profile_image_url string,
tag string,
sentiment string,
stanfordSentiment string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/social/twitter';
Now you can create charts and graphs from your social data.
... View more
Labels:
08-17-2016
07:51 PM
that was my second problem after fixing the stringdecoder, so I upped my ulimit and that removed that second issue. Thanks!
... View more
08-16-2016
07:19 PM
Out of 14,000+ files, 3 had the wrong old schema. So Spark picked that schema. Did a skipTrash delete on those and restarted job. Now it works.
... View more