Reply
Highlighted
New Contributor
Posts: 2
Registered: ‎08-01-2016

Streaming Tweets with FLUME, with HDFS sink

[ Edited ]

Hi 

 

I followed the following tutorial to stream Twitter tweets into HDFS using Flume Streaming. 

 

https://www.tutorialspoint.com//apache_flume/fetching_twitter_data.htm

 

The Tweets started flowing and HDFS files were created. However when I opened the file the data seemed to be garbled and had lot of JUNK characters in it. 

 

I went a step ahead and created a HIVE table using the dataset , but a select query from the HIVE table returned exceptions. 

 

Things I need help with, 

 

1. Why are the Tweet file having JUNK characters and what needs to be changed so that they look good to be used in loading the HIVE tables.

 

Please help me out to resolve this and let me know if someother supporting data/files needs to be posted. 

 

Twitter CONF file.

 

# Naming the components on the current agent.
TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS
  
# Describing/Configuring the source 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = XXXXXXXX
TwitterAgent.sources.Twitter.consumerSecret = XXXXXXXX 
TwitterAgent.sources.Twitter.accessToken = XXXXXXXX 
TwitterAgent.sources.Twitter.accessTokenSecret = XXXXXXXX
TwitterAgent.sources.Twitter.keywords = chennairain, chennairains, savechennai, ChennaiRainsHelp, ChennaiRains, ChennaiRain, PrayForChennai 
  
# Describing/Configuring the sink 

TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020//user/cloudera/twitter/
#TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.fileType = SequenceFile
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 
 
# Describing/Configuring the channel 
TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 100
  
# Binding the source and sink to the channel 
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel 

Screenshot of the HDFS file with JUNK character.

 
Flume _tweet_JUNK1.png

 

 

 

 

 

Flume _tweet_JUNK2.png

 

 

 

 

Champion
Posts: 777
Registered: ‎05-16-2016

Re: Streaming Tweets with FLUME, with HDFS sink

You are presisting the data in Sequence format . according to your configuration 

 

 

#TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.fileType = SequenceFile

 

Trying using the DataStream fileType of your Hdfs file . 

 

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
New Contributor
Posts: 3
Registered: ‎03-30-2017

Re: Streaming Tweets with FLUME, with HDFS sink

Hi,

 

I am getting the same issue and have checked the property

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream which is already DataStream.

 

Any idead on how to resolve this issue.