Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

which file its producing , JSON or AVRO ?

avatar
Super Collaborator

with the commands below , what type of file is being produced . JSON or AVRO ?

flume-ng agent --conf ./conf/ -f conf/twitter-to-hdfs.properties --name TwitterAgent  -Dflume.root.logger=WARN,console -Dtwitter4j.http.proxyHost=proxy.server.com -Dtwitter4j.http.proxyPort=8080
[flume@hadoop1 conf]$ pwd
/home/flume/conf
[flume@hadoop1 conf]$
[flume@hadoop1 conf]$ more twitter-to-hdfs.properties
########################################################
# Twitter agent for collecting Twitter data to HDFS.
########################################################
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
########################################################
# Describing and configuring the sources
########################################################
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.Channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret =xxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.Keywords = hadoop,Data Scientist,BigData,Trump,computing,flume,Nifi
#######################################################
# Twitter configuring  HDFS sink
########################################################
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.WriteFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
#######################################################
# Twitter Channel
########################################################
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 20000
#TwitterAgent.channels.MemChannel.DataDirs =
TwitterAgent.channels.MemChannel.transactionCapacity =1000
#######################################################
# Binding the Source and the Sink to the Channel
########################################################
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channels = MemChannel
[flume@hadoop1 conf]$

1 ACCEPTED SOLUTION

avatar
Master Guru

Your output is AVRO.

I looked at your ZIP and that's an AVRO file.

Flume outputs AVRO from twitter

https://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

You can also ingest Twitter to HDFS via Apache NiFi

http://hortonworks.com/blog/hdf-2-0-flow-processing-real-time-tweets-strata-hadoop-slack-tensorflow-...

View solution in original post

5 REPLIES 5

avatar
Super Guru

@Sami Ahmad

Flume is simply moving your data from source to target. In this case from twitter to HDFS. I believe twitter sends JSON records. This means file being written is JSON format. Flume is not altering your file format. It is only moving data.

avatar
Super Collaborator

flumedata.zipif that is the case then its not matching the JSON format . please see the attached file

avatar
Master Guru

it's avro format

avatar
Super Collaborator

as you can see I cant read it using JSON

[hdfs@hadoop1 ~]$ more a.py
#!/usr/bin python
import json
with open('FlumeData.1477426267073') as f:
        data = f.read()
        jsondata = json.loads(data)
print jsondata
[hdfs@hadoop1 ~]$ python a.py
Traceback (most recent call last):
  File "a.py", line 7, in <module>
    jsondata = json.loads(data)
  File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.6/json/decoder.py", line 338, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

avatar
Master Guru

Your output is AVRO.

I looked at your ZIP and that's an AVRO file.

Flume outputs AVRO from twitter

https://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

You can also ingest Twitter to HDFS via Apache NiFi

http://hortonworks.com/blog/hdf-2-0-flow-processing-real-time-tweets-strata-hadoop-slack-tensorflow-...