Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to turn twitter data into hive tables

Hello everyone,

I know that there are already several topics here about my doubt, the point is that, none resolved.

I have the following situation.

I was able to get twitter data via flume.

They are already on HDFS.

But I can not turn them into Hive tables.

the path of my files is in:

/usr/root/flume/test File: FlumeData.1505877214590

the header of my file in hdfs is as follows: Header:

-------------------------------------------------------------------------------

{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}

-------------------------------------------------------------------------------

I've tried the script below, or other variants of it and nothing.

-------------------------------------------------------------------------------

CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, userr:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, userr STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/flume/test';

ERROR:

org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: java.io.ByteArrayInputStream@1b1f2ed4; line: 1, column: 2]

-------------------------------------------------------------------------------

or for the other script

-------------------------------------------------------------------------------

CREATE EXTERNAL TABLE tweets1 ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, userr:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, userr STRUCT< time_zone: STRING, screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/flume/test';

ERROR:

java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Field named time_zone already exists

-------------------------------------------------------------------------------

Can someone help me with this?

7 REPLIES 7

New Contributor

in the second method, you have the time_zone within the STRUCT construct mentioned twice.

Hi,

When I remove one of the time_zone columns, I get the error below, which is the same as the previous script, how can I handle this, and this data comes straight from twitter:

org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') 

at [Source: java.io.ByteArrayInputStream@6f5728d5; line: 1, column: 2]

@Elton Freitas

Try to check this tutorial on how to SQLize the tweet data. This is a two part tutorial and has the link to first part as well.

https://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-quer...

My problem is in the construction of the table in HIVE to receive the data from the twitter file, someone seeing the header that I put can help me in building the script that receives the data? I'm still walking in that part and having that difficulty.

/usr/root/flume/test File: FlumeData.1505877214590

the header of my file in hdfs is as follows: Header:

-------------------------------------------------------------------------------

Objavro.schemaä {"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}

@Elton Freitas

Could you attach sample data file? I could try in my machine.

Hi,

here is a sample of my files in hdfs.

thank´s.

flumedata1505877214592tmp.zip

New Contributor

I am facing the exact same problem , please let me know if you solve this issue. @Elton Freitas

Elton Freitas

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.