Created 09-20-2017 12:55 PM
Hello everyone,
I know that there are already several topics here about my doubt, the point is that, none resolved.
I have the following situation.
I was able to get twitter data via flume.
They are already on HDFS.
But I can not turn them into Hive tables.
the path of my files is in:
/usr/root/flume/test File: FlumeData.1505877214590
the header of my file in hdfs is as follows: Header:
-------------------------------------------------------------------------------
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}
-------------------------------------------------------------------------------
I've tried the script below, or other variants of it and nothing.
-------------------------------------------------------------------------------
CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, userr:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, userr STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/flume/test';
ERROR:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: java.io.ByteArrayInputStream@1b1f2ed4; line: 1, column: 2]
-------------------------------------------------------------------------------
or for the other script
-------------------------------------------------------------------------------
CREATE EXTERNAL TABLE tweets1 ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, userr:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, userr STRUCT< time_zone: STRING, screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/flume/test';
ERROR:
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Field named time_zone already exists
-------------------------------------------------------------------------------
Can someone help me with this?
Created 09-20-2017 01:26 PM
in the second method, you have the time_zone within the STRUCT construct mentioned twice.
Created 09-20-2017 01:47 PM
Hi,
When I remove one of the time_zone columns, I get the error below, which is the same as the previous script, how can I handle this, and this data comes straight from twitter:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream@6f5728d5; line: 1, column: 2]
Created 09-20-2017 01:53 PM
Created 09-20-2017 02:14 PM
My problem is in the construction of the table in HIVE to receive the data from the twitter file, someone seeing the header that I put can help me in building the script that receives the data? I'm still walking in that part and having that difficulty.
/usr/root/flume/test File: FlumeData.1505877214590
the header of my file in hdfs is as follows: Header:
-------------------------------------------------------------------------------
Objavro.schemaä {"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}
Created 09-20-2017 02:16 PM
Could you attach sample data file? I could try in my machine.
Created 09-20-2017 02:44 PM
Created 05-21-2018 09:13 AM
I am facing the exact same problem , please let me know if you solve this issue. @Elton Freitas