Created 10-27-2016 03:54 PM
with the commands below , what type of file is being produced . JSON or AVRO ?
flume-ng agent --conf ./conf/ -f conf/twitter-to-hdfs.properties --name TwitterAgent -Dflume.root.logger=WARN,console -Dtwitter4j.http.proxyHost=proxy.server.com -Dtwitter4j.http.proxyPort=8080 [flume@hadoop1 conf]$ pwd /home/flume/conf [flume@hadoop1 conf]$ [flume@hadoop1 conf]$ more twitter-to-hdfs.properties ######################################################## # Twitter agent for collecting Twitter data to HDFS. ######################################################## TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS ######################################################## # Describing and configuring the sources ######################################################## TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.Channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = xxxxxxxx TwitterAgent.sources.Twitter.consumerSecret =xxxxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.Keywords = hadoop,Data Scientist,BigData,Trump,computing,flume,Nifi ####################################################### # Twitter configuring HDFS sink ######################################################## TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.WriteFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 ####################################################### # Twitter Channel ######################################################## TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 20000 #TwitterAgent.channels.MemChannel.DataDirs = TwitterAgent.channels.MemChannel.transactionCapacity =1000 ####################################################### # Binding the Source and the Sink to the Channel ######################################################## TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sinks.HDFS.channels = MemChannel [flume@hadoop1 conf]$
Created 10-27-2016 05:20 PM
Your output is AVRO.
I looked at your ZIP and that's an AVRO file.
Flume outputs AVRO from twitter
https://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
You can also ingest Twitter to HDFS via Apache NiFi
Created 10-27-2016 04:28 PM
Flume is simply moving your data from source to target. In this case from twitter to HDFS. I believe twitter sends JSON records. This means file being written is JSON format. Flume is not altering your file format. It is only moving data.
Created 10-27-2016 05:08 PM
flumedata.zipif that is the case then its not matching the JSON format . please see the attached file
Created 10-27-2016 05:21 PM
it's avro format
Created 10-27-2016 05:22 PM
as you can see I cant read it using JSON
[hdfs@hadoop1 ~]$ more a.py #!/usr/bin python import json with open('FlumeData.1477426267073') as f: data = f.read() jsondata = json.loads(data) print jsondata [hdfs@hadoop1 ~]$ python a.py Traceback (most recent call last): File "a.py", line 7, in <module> jsondata = json.loads(data) File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.6/json/decoder.py", line 338, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded
Created 10-27-2016 05:20 PM
Your output is AVRO.
I looked at your ZIP and that's an AVRO file.
Flume outputs AVRO from twitter
https://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
You can also ingest Twitter to HDFS via Apache NiFi