Reply
New Contributor
Posts: 4
Registered: ‎05-20-2014

CDH4 and CDH5: flume twitter data looks corrupted

the flume data is full of unsupported characters:

 

http://i.imgur.com/kM5Cqir.png

 

also, when I try to edit the file, I’m getting the following error message:

 

http://i.imgur.com/aOmQ6cN.png

 

File is not encoded in utf; cannot be edited: /user/flume/tweets/2014/07/08/12/FlumeData.1404847995652.

 

And when I try to import the flume data into the hive tables using the metastore manager,

 

I’m getting the following errors:

 

java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character (‘O’ (code 79)): expected a valid value (number, String, array, object, ‘true’, ‘false’ or ‘null’)
at [Source: java.io.StringReader@4f7f1d92; line: 1, column: 2]

 

can you help me solve this problem with my twitter data?

 

Cloudera is now absolut useless for me. 

Posts: 1,673
Kudos: 329
Solutions: 263
Registered: ‎07-31-2013

Re: CDH4 and CDH5: flume twitter data looks corrupted

Your file appears to have gotten non-UTF-8-supported data in it. This may be cause the incoming data was not requested to be in UTF-8 format.

Since Hue and Hive both require the UTF-8 format, the errors are observed.

You may want to also read http://stackoverflow.com/questions/8274972/official-encoding-used-by-twitter-streaming-api-is-it-utf... to perhaps try and tweak your Flume source technique into ensuring you do get data encoded in UTF-8 form?

Twitter as a service itself does not penalise users if they send tweets in differing formats.
New Contributor
Posts: 1
Registered: ‎06-23-2015

Re: CDH4 and CDH5: flume twitter data looks corrupted

Hi Jermo,

 

just wanted to know, did you find a solution to this?

I'm facing the same issue...

 

org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Invalid UTF-8..

 

Kind regards,

Thierry

 

Contributor
Posts: 49
Registered: ‎07-26-2016

Re: CDH4 and CDH5: flume twitter data looks corrupted

Hi All,

 

Me to facing the same while loading the twitter data into hive table via metastore.

Where I placed the Hive serde jar in Hive/lib and added jar in Hive editor as fallows:

Add jar /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar

 

Below error i am facing in quickstart cloudera 5.7 :

 

java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: java.io.StringReader@9cf3f4e; line: 1, column: 2]

 

Thanks,

Syam.

New Contributor
Posts: 3
Registered: ‎03-25-2018

Re: CDH4 and CDH5: flume twitter data looks corrupted

I am facing the same issue. Was anyone able to find a solution to this ?

Error====

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader@83ac653; line: 1, column: 2]

Posts: 1,673
Kudos: 329
Solutions: 263
Registered: ‎07-31-2013

Re: CDH4 and CDH5: flume twitter data looks corrupted

Are you certain your Flume sink is configured to emit plaintext output, as
opposed to a SequenceFile or Avro DataFile? The character 'O' specifically
may appear in the foremost sequence of an Avro DataFile, for example:
http://avro.apache.org/docs/current/spec.html#Object+Container+Files

If you indeed want to use an Avro DataFile, then your Hive definitions
should be changed accordingly.
New Contributor
Posts: 3
Registered: ‎03-25-2018

Re: CDH4 and CDH5: flume twitter data looks corrupted

Hi Harsh,

 

Here is my conf file that i am using.

 

# Please paste flume.conf here. Example:
 
# Sources, channels, and sinks are defined per
# agent name, in this case 'tier1'.
 
 
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
 
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxx
TwitterAgent.sources.Twitter.accessToken = xxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxx
 
TwitterAgent.sources.Twitter.keywords = hadoop, big data
 
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = /user/navnit/flume/twitter
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
 
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
 
Please let me know if this needs a change.
Announcements