Reply
Explorer
Posts: 12
Registered: ‎03-16-2016

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

Hi there,

 

There is tiny problem when I try Cloudera 5.4.2. Base on this article 

 

Apache Flume - Fetching Twitter Data

 

It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. 

 

Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large". 

 

Oh, what is avro block and the limitation of the block size? Can I change it? What does it mean according to this message? Is it file's fault? Is it some records' fault? If Twitter's streaming met error data, it should core down. If it is all good to convert the tweets to Avro format, reversely, the Avro data should be read correctly, right?

 

And I also try the avro-tools-1.7.7.jar 

 

 

java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232

{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}
{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`・ω・´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more

The same problem. I google it a lot, no answers at all. 

 

Could anyone give me a solution if you have met this problem too? Or somebody help to give a clue if you fully understand Avro stuff or Twitter streaming underneath. 

 

It is really intereting problem. Think about it. 

 

 

Explorer
Posts: 12
Registered: ‎03-16-2016

Re: Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

Explorer
Posts: 9
Registered: ‎05-12-2016

Re: Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

Is it fixed or you freshly ingested new tweets. My probelem is that I can switch now to cloudea flume source but ealier data will be lost. Kindly help. 

Announcements