Support Questions

Find answers, ask questions, and share your expertise

Not able to flume twitter data in to hdfs

avatar
Rising Star

Hello team,

I'm a programming enthusiast.I have downloaded twitter stream before but now i'm not able to do so.I'm using apache-flume-1.4 on my hadoop 2.3.0 and cdh 5.0.0.

No matter how many times i've tried ,it is throwing the same error,

hadoop@ubuntu:~/hadoop/apache-flume-1.4.0-cdh5.0.0-bin$ ./bin/flume-ng agent -n TwitterAgent -c conf -f /home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin/conf/local.conf Dflume.root.logger=DEBUG,console -n TwitterAgent


Info: Sourcing environment configuration script /home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/home/hadoop/hadoop/hadoop-2.3.0-cdh5.0.0/bin/hadoop) for HDFS access
Info: Excluding /home/hadoop/hadoop/hadoop-2.3.0-cdh5.0.0/share/hadoop/common/lib/slf4j-api-1.7.5.jar from classpath
Info: Excluding /home/hadoop/hadoop/hadoop-2.3.0-cdh5.0.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar from classpath
Info: Including HBASE libraries found via (/home/hadoop/hadoop/hbase-0.96.1.1-cdh5.0.0/bin/hbase) for HBASE access
Info: Excluding /home/hadoop/hadoop/hbase-0.96.1.1-cdh5.0.0/lib/slf4j-api-1.7.5.jar from classpath
Info: Excluding /home/hadoop/hadoop/hbase-0.96.1.1-cdh5.0.0/lib/slf4j-log4j12-1.7.5.jar from classpath
Info: Excluding /home/hadoop/hadoop/hadoop-2.3.0-cdh5.0.0/share/hadoop/common/lib/slf4j-api-1.7.5.jar from classpath
Info: Excluding /home/hadoop/hadoop/hadoop-2.3.0-cdh5.0.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar from classpath
+ exec /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms100m -Xmx200m -Dcom.sun.management.jmxremote -cp '/home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin/conf:/home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin/lib/*:/home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar:/home/hadoop/hadoop/hadoop-2.3.0-cdh5.0.0/etc/hadoop:/home/ha.....

And the .conf file is as follows:

TwitterAgent.sources= Twitter 
TwitterAgent.channels= MemChannel 
TwitterAgent.sinks=HDFS 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource 
TwitterAgent.sources.Twitter.channels=MemChannel 
 
TwitterAgent.sources.Twitter.consumerKey=Pw63cpjptT59uT6w 
TwitterAgent.sources.Twitter.consumerSecret=    n8awrhKf7S576DcILPk5Ddfp1LQUU 
TwitterAgent.sources.Twitter.accessToken=163543326-s0Rqm5y4UC2WV7HPOuiOE9fPZZ56eWO95P 
TwitterAgent.sources.Twitter.accessTokenSecret=    CLwyJJ1jY4atf7iaiaR96Z1PmVvKF0iOXsP8E 
 
TwitterAgent.sources.Twitter.keywords= hadoop,election,sports, cricket,Big data,Trump 
 
TwitterAgent.sinks.HDFS.channel=MemChannel 
TwitterAgent.sinks.HDFS.type=hdfs 
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/tweety 
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000 
TwitterAgent.sinks.HDFS.hdfs.rollSize=0 
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000 
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600 
TwitterAgent.channels.MemChannel.type=memory 
TwitterAgent.channels.MemChannel.capacity=10000 
TwitterAgent.channels.MemChannel.transactionCapacity=100

And flume-env.sh file as follows:

# Enviroment variables can be set here.

JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"

# Note that the Flume conf directory is always included in the classpath.
FLUME_CLASSPATH="/home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar"

And the .bashrc file:

export FLUME_HOME="/home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin"
export PATH="$FLUME_HOME/bin:$PATH"
export FLUME_CLASSPATH="$CLASSPATH:/home/hadoop/hadoop/apache-flume-1.4.0-cdh5.0.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar "

Please i want to know on which part i'm doing it wrong.

Any valuable suggestion is much appreciated.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
New Contributor

Able to flume twitter feeds in the sandbox after spending lot of time.

Following steps helped in resolving this:

1. Added below entry in /etc/hosts file

199.59.148.138 stream.twitter.com

2. updating datetime in sandbox

sudo ntpdate ntp.ubuntu.com

3. Adjusting hdfs path to point to 8020 port

TwitterAgent.sinks.HDFS.hdfs.path=hdfs://sandbox.hortonworks.com:8020/user/maria_dev/tweets/%Y/%m/%d/%H/

View solution in original post

4 REPLIES 4

avatar
Super Guru
@karthik sai

Looks like you are using CDH distro therefore I would recommend you to run sam test on HDP cluster with Flume and let us know if you still face any issue.

avatar
Rising Star

So, apache-flume-1.4 can still bring the data? or shall i upgrade my flume to 1.6 or higher?

avatar
Super Guru

@karthik sai

Hi Karthik, I was saying if you can install the Hortonworks Hadoop cluster or probably a Sandbox machine along with flume would help us to understand your issue while you run the same flume example on that.

Here is the download link of sandbox.

http://hortonworks.com/downloads/#sandbox

avatar
New Contributor

Able to flume twitter feeds in the sandbox after spending lot of time.

Following steps helped in resolving this:

1. Added below entry in /etc/hosts file

199.59.148.138 stream.twitter.com

2. updating datetime in sandbox

sudo ntpdate ntp.ubuntu.com

3. Adjusting hdfs path to point to 8020 port

TwitterAgent.sinks.HDFS.hdfs.path=hdfs://sandbox.hortonworks.com:8020/user/maria_dev/tweets/%Y/%m/%d/%H/