Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Streaming Tweets with FLUME, with HDFS sink


Streaming Tweets with FLUME, with HDFS sink

New Contributor



I followed the following tutorial to stream Twitter tweets into HDFS using Flume Streaming.


The Tweets started flowing and HDFS files were created. However when I opened the file the data seemed to be garbled and had lot of JUNK characters in it. 


I went a step ahead and created a HIVE table using the dataset , but a select query from the HIVE table returned exceptions. 


Things I need help with, 


1. Why are the Tweet file having JUNK characters and what needs to be changed so that they look good to be used in loading the HIVE tables.


Please help me out to resolve this and let me know if someother supporting data/files needs to be posted. 


Twitter CONF file.


# Naming the components on the current agent.
TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS
# Describing/Configuring the source 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = XXXXXXXX
TwitterAgent.sources.Twitter.consumerSecret = XXXXXXXX 
TwitterAgent.sources.Twitter.accessToken = XXXXXXXX 
TwitterAgent.sources.Twitter.accessTokenSecret = XXXXXXXX
TwitterAgent.sources.Twitter.keywords = chennairain, chennairains, savechennai, ChennaiRainsHelp, ChennaiRains, ChennaiRain, PrayForChennai 
# Describing/Configuring the sink 

TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020//user/cloudera/twitter/
#TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.fileType = SequenceFile
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 
# Describing/Configuring the channel 
TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 100
# Binding the source and sink to the channel 
TwitterAgent.sources.Twitter.channels = MemChannel = MemChannel 

Screenshot of the HDFS file with JUNK character.

Flume _tweet_JUNK1.png






Flume _tweet_JUNK2.png






Re: Streaming Tweets with FLUME, with HDFS sink


You are presisting the data in Sequence format . according to your configuration 



#TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.fileType = SequenceFile


Trying using the DataStream fileType of your Hdfs file . 


TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

Re: Streaming Tweets with FLUME, with HDFS sink

New Contributor



I am getting the same issue and have checked the property

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream which is already DataStream.


Any idead on how to resolve this issue.