Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

bad HDFS sink property

avatar
Super Collaborator

events1476284674520.zip

I have come to the conclusion that the properties file is bad and therefor producing the bad JSON file , can someone point out how I can correct it ? I am uploading the json file its producing, if someone can confirm its bad .

flume-ng agent --conf-file twitter-to-hdfs.properties --name agent1  -Dflume.root.logger=WARN,console -Dtwitter4j.http.proxyHost=dotatofwproxy.tolls.dot.state.fl.us -Dtwitter4j.http.proxyPort=8080
[root@hadoop1 ~]# more twitter-to-hdfs.properties
agent1.sources =source1
agent1.sinks = sink1
agent1.channels = channel1

agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
agent1.sources.source1.type = org.apache.flume.source.twitter.TwitterSource
agent1.sources.source1.consumerKey = xxxxxxxxxxxxxxxxxxxxxxxxxTaz
agent1.sources.source1.consumerSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxCI9
agent1.sources.source1.accessToken = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxwov
agent1.sources.source1.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxY5H3
agent1.sources.source1.keywords = Clinton Trump
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /user/flume/tweets
agent1.sinks.sink1.hdfs.filePrefix = events
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1.sinks.sink1.hdfs.inUsePrefix = _
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.channels.channel1.type = file

1 ACCEPTED SOLUTION

avatar
Super Collaborator
7 REPLIES 7

avatar
Super Collaborator

Sami, I don't see keywords listed as a property for the TwitterSource

https://flume.apache.org/FlumeUserGuide.html#twitter-1-firehose-source-experimental

However, your upload looks to be an avro file, which is what the documentation says you will receive from the source. What is it about your result that you think is incorrect?

avatar
Super Collaborator

because I cant read it into hive in a standard way which I see many people are using on web and it works for them .

and I have tried 4 different SerDe's .. all give error.

CREATE EXTERNAL TABLE tweetdata3(created_at STRING,
text STRING,
  person STRUCT< 
     screen_name:STRING,
     name:STRING,
     locations:STRING,
     description:STRING,
     created_at:STRING,
     followers_count:INT,
     url:STRING>
) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'  location '/user/flume/tweets';

hive>
    >
    > select person.name,person.locations, person.created_at, text from tweetdata3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: java.io.ByteArrayInputStream@2bc779ed; line: 1, column: 2]
Time taken: 0.274 seconds
hive>

avatar
Master Mentor

@Sami Ahmad

Look at my code below I did exactly what you wantd to do and it work just copy and substitute the values to correspond with your environment it should work.

And this is how you launch it ! Substitute the values to fit your setup

/usr/bin/flume-ng agent -c /etc/flume-ng/conf -f /etc/flume-ng/conf/flume.conf -n agent
#######################################################
# This is a test configuration created the 31/07/2016
#    by Geoffrey Shelton Okot
#######################################################
# Twitter Agent
########################################################
# Twitter agent for collecting Twitter data to HDFS.
########################################################
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
########################################################
# Describing and configuring the sources
########################################################
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.Channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxx
TwitterAgent.sources.Twitter.Keywords = hadoop,Data Scientist,BigData,Trump,computing,flume,Nifi
#######################################################
# Twitter configuring  HDFS sink
########################################################
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://namenode.com:8020/user/flume
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.WriteFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
#######################################################
# Twitter Channel
########################################################
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 20000
#TwitterAgent.channels.MemChannel.DataDirs =
TwitterAgent.channels.MemChannel.transactionCapacity =1000
#######################################################
# Binding the Source and the Sink to the Channel
########################################################
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channels = MemChannel
########################################################

avatar
Super Collaborator

flumedata.zip

hi Geoffery I created new files based on your template but I still get the same error when querying through hive.

did you try creating the table in hive and querying it ? please try it and also try it against my file I am uploading it.

thanks

CREATE EXTERNAL TABLE tweetdata3(created_at STRING,
text STRING,
  person STRUCT< 
     screen_name:STRING,
     name:STRING,
     locations:STRING,
     description:STRING,
     created_at:STRING,
     followers_count:INT,
     url:STRING>
) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'  location '/user/flume/tweets';


hive> describe tweetdata3;
OK
created_at              string                  from deserializer
text                    string                  from deserializer
person                  struct<screen_name:string,name:string,locations:string,description:string,created_at:string,followers_count:int,url:string>     from deserializer
Time taken: 0.311 seconds, Fetched: 3 row(s)
hive> select person.name,person.locations, person.created_at, text from tweetdata3;select person.name,person.locations, person.created_at, text from tweetdata3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: java.io.ByteArrayInputStream@4d5fa2a; line: 1, column: 2]
Time taken: 0.333 seconds
hive>

avatar
Super Collaborator

can you give me your hive create table statement please? and the select command from the table ?

avatar
Super Collaborator

I found the issue thanks to a post on stackoverflow , please see below .

http://stackoverflow.com/questions/30657983/type-error-string-from-deserializer-instead-of-int-when-...

avatar
Explorer

@Sami Ahmad did you finally get any solution. After a year also I'm getting same error.