Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Wrong file format

Solved Go to solution
Highlighted

Wrong file format

New Contributor

I am working in Cloudera and have just started to learn it. So I have been trying to implement a famous twitter example with flume. With efforts, I have been able to stream the data from Twitter and now it is being saved in a file. After I have got the data now I want to perform analysis on Twitter data. But the issue is I cannot get the twitter data in the table. I have successfully created the "tweets" table but cannot load the data in the table. Below I have given Twitter.conf file, external table creation query, data load query, error message and some chunk of the data I have got. Kindly guide me where I am doing wrong. Please note I have been writing the queries in HIVE editor.

 

Twitter.conf file

# Naming the components on the current agent. 
TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS

# Describing/Configuring the source 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = 95y0IPClnNPUTJ1AHSfvBLWes
TwitterAgent.sources.Twitter.consumerSecret = UmlNcFwiBIQIvuHF9J3M3xUv6UmJlQI3RZWT8ybF2KaKcDcAw5
TwitterAgent.sources.Twitter.accessToken = 994845066882699264-Yk0DNFQ4VJec9AaCQ7QTBlHldK5BSK1 
TwitterAgent.sources.Twitter.accessTokenSecret =  q1Am5G3QW4Ic7VBx6qJg0Iv7QXfk0rlDSrJi1qDjmY3mW
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing



# Describing/Configuring the channel 
TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel 
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel 

# Describing/Configuring the sink 

TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = /user/cloudera/latestdata/
TwitterAgent.sinks.flumeHDFS.hdfs.fileType = DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 

External table query and load data in table query

 

CREATE External  TABLE tweets (


id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
     text:STRING,
     user:STRUCT<screen_name:STRING,name:STRING>>,
   entities STRUCT<
     urls:ARRAY<STRUCT<expanded_url:STRING>>,
     user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
     hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
     screen_name:STRING,
     name:STRING,
     friends_count:INT,
     followers_count:INT,
     statuses_count:INT,
     verified:BOOLEAN,
     utc_offset:INT,
     time_zone:STRING>,
   in_reply_to_screen_name STRING
 ) 
 PARTITIONED BY (datehour INT)
 ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
 LOCATION '/user/cloudera/tweets';

LOAD DATA INPATH '/user/cloudera/latestdata/FlumeData.1540555155464'
INTO TABLE `default.tweets`
PARTITION (datehour='2013022516')

Error When I try to load data into the table

 

Error while processing statement: FAILED: Execution Error, return code 20013 from org.apache.hadoop.hive.ql.exec.MoveTask. Wrong file format. Please check the file's format.

twitter data file I got

 

SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.Text� �����R�LX� }H�f�>(�H�Objavro.schema� {"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}�yږ���w����M߀J��&1055790978844540929����gracie 🔪owehimnothng(2018-10-26T04:59:19Z�GIRLS WE THROWING IT BACK FOR JOAN OF

It has been 1 week and not able to figure out what is the solution. Please let me know if more information is needed I will provide it here.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Wrong file format

Rising Star

Hi zain52,

 

Please review this example:

 

https://github.com/cloudera/cdh-twitter-example

4 REPLIES 4

Re: Wrong file format

Rising Star

Hi,

 

Flume's HDFSEventSink writes to a SequenceFile by default. The name of your HDFS sink is wrong in your Flume configuration. Please change it to this: 

 

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream 

 

Here is the documentation:


https://flume.apache.org/FlumeUserGuide.html#hdfs-sink

 

Best regards,

 

      Gabor

Re: Wrong file format

New Contributor
Gabor now the data is loaded into the table but when I execute a query it gives this error
Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='t\r\x08\xefM\xb1E\x08\x99\x88\x86\x8e]\xee\xcd\x01', guid='\xd6\xe0\xa7\x041\x10JE\x97\x1b63\x18\xdf\\\xd0')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage="java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: java.io.ByteArrayInputStream@3e1ad184; line: 1, column: 2]", sqlState=None, infoMessages=["*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: java.io.ByteArrayInputStream@3e1ad184; line: 1, column: 2]:25:24", 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:463', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:294', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:769', 'sun.reflect.GeneratedMethodAccessor28:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:415', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1917', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy26:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:462', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:694', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1553', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1538', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615', 'java.lang.Thread:run:Thread.java:745', "*java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: java.io.ByteArrayInputStream@3e1ad184; line: 1, column: 2]:29:4", 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:508', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:415', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:140', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2069', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:458', "*org.apache.hadoop.hive.serde2.SerDeException:org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: java.io.ByteArrayInputStream@3e1ad184; line: 1, column: 2]:30:1", 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:174', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:489', "*org.codehaus.jackson.JsonParseException:Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: java.io.ByteArrayInputStream@3e1ad184; line: 1, column: 2]:36:6", 'org.codehaus.jackson.JsonParser:_constructError:JsonParser.java:1291', 'org.codehaus.jackson.impl.JsonParserMinimalBase:_reportError:JsonParserMinimalBase.java:385', 'org.codehaus.jackson.impl.JsonParserMinimalBase:_reportUnexpectedChar:JsonParserMinimalBase.java:306', 'org.codehaus.jackson.impl.Utf8StreamParser:_handleUnexpectedValue:Utf8StreamParser.java:1582', 'org.codehaus.jackson.impl.Utf8StreamParser:_nextTokenNotInObject:Utf8StreamParser.java:437', 'org.codehaus.jackson.impl.Utf8StreamParser:nextToken:Utf8StreamParser.java:323', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:163'], statusCode=3), results=None, hasMoreRows=None)

Re: Wrong file format

New Contributor
@Croczei can you help me what is this error, please?

Re: Wrong file format

Rising Star

Hi zain52,

 

Please review this example:

 

https://github.com/cloudera/cdh-twitter-example