Support Questions
Find answers, ask questions, and share your expertise

Flume: Twitter data looks corrupt

Highlighted

Flume: Twitter data looks corrupt

So I am trying to query data that I used flume to fetch from twitter in hive but I keep getting this error when trying to do a select statement

 

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce

and i keep getting this unexpected character error as well

 

Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')

If I goto the hue file browser and try to view the data file it looks weird I think my data is corrupt

 

Screen Shot 2016-11-28 at 7.12.40 PM.pngScreen Shot 2016-11-28 at 7.12.23 PM.png

 

Here is my flumetwitter.conf file

 

TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS
  
# Describing/Configuring the source 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey=uX0TWqkx0okYEjjqLzxIx6mD6
TwitterAgent.sources.Twitter.consumerSecret=rzHIs3TMJnADbZNvdGU7LQUo0kPxPISq3RGSLfqcBip39X5END
TwitterAgent.sources.Twitter.accessToken=559516596-yDA9xqOljo4CV32wSnqsx2BXh4RBIRKFxZGSZrPC
TwitterAgent.sources.Twitter.accessTokenSecret=zDxePILZitS5tIWBhre0GWqps0FIj9OadX8RZb6w8ZCwz
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000
TwitterAgent.sources.Twitter.keywords=hadoop, bigdata, mapreduce, mahout, hbase, nosql
# Describing/Configuring the sink 

TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:8020/user/cloudera/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600

TwitterAgent.channels.MemChannel.type=memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=1000

TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

 

and the command im using to run it

 

flume-ng agent -n TwitterAgent -f /usr/lib/flume-ng/conf/flumetwitter.conf

Here is my full error log when I try to query using hive. its too big to paste in full:

Pastebin error log

 

Error when I try to run it from terminal

pastebin error log when trying to query from table

 

What could my issue be? Is it my flumetwitter.conf file?

 

2 REPLIES 2
Highlighted

Re: Flume: Twitter data looks corrupt

anybody?

Highlighted

Re: Flume: Twitter data looks corrupt

Super Collaborator
Unfortunately, the apache twitter source is a bit broken in its implementation, in that it will attempt to append an avro datum to the hdfs sink file for each batch.
It is recommended to compile the Cloudera version of the twitter source from this Blog post:
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-...

From here:
https://github.com/cloudera/cdh-twitter-example

The twitter sources are considered experimental and should not be relied on for production use so please keep that in mind.

-pd
Don't have an account?