Hi, I tried the basic flume assignment by creating a text file which eventually is copied to HDFS. But now when I try to work on the Twitter example, I am facing problems and I think its because of these points:
1. I have used the cloudera provided jar file to copy Twitter data. This should not make a difference?
2. I have placed the jar file in /usr/lib/flume/lib/ directory. Is this the correct place to put the jar?
3. In FLUME_CLASSPATH, I have give the above library path.
4. I am starting the flume agent using following command:
$ bin/flume-ng agent -n $TwitterAgent -c conf -f conf/flume-conf
I have no clue where I am going wrong....or may be i am missing the basics somewhere. It would be good if someone can provide some guidance. Thanks a lot.
if this is Cloudera specific, we can't help much as configurations are different across both platforms. If I may recommend looking at Apache nifi and save you tons of grief. https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and
Thanks for your prompt response. Actually I am using the Hortonworks Sandbox, but I was just using the Cloudera provided JAR which copies the twitter data to HDFS as source. If you can advise where I can get the jar which is compatible with Hortonworks / if Hortonworks has any such JAR, that would help. I was aware of nifi but I want to do this assignment using Flume.
I'm not sure I understand, Twitter firehose source is available in Flume. The only thing I'm not sure about is whether it is available in our version. It was released as part of flume 1.6, we are on 1.5.2. A lot of the features were backported to our version but I can't say for sure about this source. There is no jar, you just plugin your Twitter application credentials and you are good to go.
I think I understand now, here's an example http://www.thecloudavenue.com/2013/03/analyse-tweets-using-flume-hadoop-and.html
It looks like you need to use their version of flume an one rely on Ambari managed Flume. You can use their jar and start it from location you set it to. Do not confuse their jar with our Flume version. I would put their jar in home directory or /user/local then start that agent pointing to the custom location you chose
I will try this way and let you know if it worked. Thanks again for your help.
What's the error you got?
I doubt is smth cloudera specific. Flume is a very simple tool in terms of installation and configuration.
To adjust classpath per agent you can add "--classpath" argument to the command.
From my experience, if you need to use flume with twitter is better to remove everything related to twitter libs from the flume default classpath. Just to avoid issues with twitter4j dependencies.
Have no idea why they included it to installation by default.
I have a 4 node cluster with HDP 22.214.171.124 and a compatible ambari agent.I have flume running on 4th node and name node on 1st node, Now where should i run the flume agent and should i configure the flume.conf on the machine or on the ambari UI?? I'm totally confused with this multi node cluster and ambari. Can you please guide me to successfully flume the twitter data.Any useful link may help.
Thanks in advance.
Settings flume agents must do it in Ambari UI. I made the capture of twits with flume and has worked for me in the following link: http://blog.hubacek.uk/streaming-tweets-into-hadoop-part-ii/
A step followed you set up agents in ambari UI, restart flume from the ambari and should begin to capture you twits.
If you have any error you can check the log flume, for example in centos (var/log/flume).
Greetings and good luck