question which file its producing , JSON or AVRO ? in Archives of Support Questions (Read Only)

which file its producing , JSON or AVRO ?

aliyesami — Thu, 27 Oct 2016 22:54:03 GMT

with the commands below , what type of file is being produced . JSON or AVRO ?

flume-ng agent --conf ./conf/ -f conf/twitter-to-hdfs.properties --name TwitterAgent  -Dflume.root.logger=WARN,console -Dtwitter4j.http.proxyHost=proxy.server.com -Dtwitter4j.http.proxyPort=8080
[flume@hadoop1 conf]$ pwd
/home/flume/conf
[flume@hadoop1 conf]$
[flume@hadoop1 conf]$ more twitter-to-hdfs.properties
########################################################
# Twitter agent for collecting Twitter data to HDFS.
########################################################
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
########################################################
# Describing and configuring the sources
########################################################
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.Channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret =xxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.Keywords = hadoop,Data Scientist,BigData,Trump,computing,flume,Nifi
#######################################################
# Twitter configuring  HDFS sink
########################################################
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.WriteFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
#######################################################
# Twitter Channel
########################################################
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 20000
#TwitterAgent.channels.MemChannel.DataDirs =
TwitterAgent.channels.MemChannel.transactionCapacity =1000
#######################################################
# Binding the Source and the Sink to the Channel
########################################################
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channels = MemChannel
[flume@hadoop1 conf]$

Re: which file its producing , JSON or AVRO ?

mqureshi — Thu, 27 Oct 2016 23:28:50 GMT

@Sami Ahmad

Flume is simply moving your data from source to target. In this case from twitter to HDFS. I believe twitter sends JSON records. This means file being written is JSON format. Flume is not altering your file format. It is only moving data.

Re: which file its producing , JSON or AVRO ?

aliyesami — Fri, 28 Oct 2016 00:08:59 GMT

flumedata.zipif that is the case then its not matching the JSON format . please see the attached file

Re: which file its producing , JSON or AVRO ?

TimothySpann — Fri, 28 Oct 2016 00:20:58 GMT

Your output is AVRO.

I looked at your ZIP and that's an AVRO file.

Flume outputs AVRO from twitter

https://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

You can also ingest Twitter to HDFS via Apache NiFi

http://hortonworks.com/blog/hdf-2-0-flow-processing-real-time-tweets-strata-hadoop-slack-tensorflow-phoenix-zeppelin/

Re: which file its producing , JSON or AVRO ?

TimothySpann — Fri, 28 Oct 2016 00:21:29 GMT

it's avro format

Re: which file its producing , JSON or AVRO ?

aliyesami — Fri, 28 Oct 2016 00:22:16 GMT

as you can see I cant read it using JSON

[hdfs@hadoop1 ~]$ more a.py
#!/usr/bin python
import json
with open('FlumeData.1477426267073') as f:
        data = f.read()
        jsondata = json.loads(data)
print jsondata
[hdfs@hadoop1 ~]$ python a.py
Traceback (most recent call last):
  File "a.py", line 7, in <module>
    jsondata = json.loads(data)
  File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.6/json/decoder.py", line 338, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded