Created 04-21-2016 11:16 AM
Hello there,
I am creating a table for storing json twitter data. I see different ways of using org.apache.hcatalog.data.JsonSerDe for it, but what would be the simpler process to use org.apache.hcatalog.data.JsonSerDe for this purpose.
Where should I get org.apache.hcatalog.data.JsonSerDe, and how to integrate it into my HIVE instance?
Thanks!
Created 04-21-2016 03:42 PM
considering you have the following data
{ "user": { "userlocation": "California, Santa Clara", "id": 222222, "name": "Hortonworks", "screenname": "hortonworks", "geoenabled": true }, "tweetmessage": "Learn more about #Spark in #HDP 2.3 with @Hortonworks founder @acmurthy in this video overview http://bit.ly/1gOyr9w #hadoop", "createddate": "2015-07-24T16:30:33", "geolocation": null}
your schema would look like so with JsonSerDe
CREATE EXTERNAL TABLE tweets ( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/'; SELECT DISTINCT tweetmessage, user.name, createddate FROM tweets WHERE user.name = 'Hortonworks' ORDER BY createddate;
Created 04-21-2016 03:42 PM
considering you have the following data
{ "user": { "userlocation": "California, Santa Clara", "id": 222222, "name": "Hortonworks", "screenname": "hortonworks", "geoenabled": true }, "tweetmessage": "Learn more about #Spark in #HDP 2.3 with @Hortonworks founder @acmurthy in this video overview http://bit.ly/1gOyr9w #hadoop", "createddate": "2015-07-24T16:30:33", "geolocation": null}
your schema would look like so with JsonSerDe
CREATE EXTERNAL TABLE tweets ( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/'; SELECT DISTINCT tweetmessage, user.name, createddate FROM tweets WHERE user.name = 'Hortonworks' ORDER BY createddate;
Created 04-22-2016 12:49 AM
Thanks Arten! quick question what does the folder '/user/root/' refers to? In my example I am specifying the /tmp/ folder where I store my twitter sample file.
CREATE TABLE TwitterExample_0( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/tmp/';
I am getting this error when I try a simple SELECT * FROM TwitterExample_0 LIMIT 10;
trace":"org.apache.ambari.view.hive.client.HiveErrorStatusException: H170 Unable to fetch results. java.io.IOException: org.apache.hadoop.security.AccessControlException: Permission denied: user=admin, access=READ_EXECUTE, inode=\"/tmp/ambari-qa\":ambari-qa:hdfs:drwx------\n\tat....
----------------------------------------------------
[hdfs@ip-172-31-34-25 ~]$ hadoop fs -chmod 777 /tmp/
[hdfs@ip-172-31-34-25 ~]$ hadoop fs -ls /tmp/
drwx------ - ambari-qa hdfs 0 2016-04-16 17:47 /tmp/ambari-qa
drwxr-xr-x - hdfs hdfs 0 2016-04-16 17:43 /tmp/entity-file-history
drwx-wx-wx - hive hdfs 0 2016-04-16 20:10 /tmp/hive
-rwxr-xr-x 3 hdfs hdfs 1902 2016-04-16 17:45 /tmp/id1fac3f21_date451616
-rwxr-xr-x 3 ambari-qa hdfs 1902 2016-04-16 17:50 /tmp/idtest.ambari-qa.1460843437.95.in
-rwxr-xr-x 3 ambari-qa hdfs 957 2016-04-16 17:50 /tmp/idtest.ambari-qa.1460843437.95.pig
-rwxrwxrwx 3 hdfs hdfs 2755 2016-04-21 20:35 /tmp/sample_twitter_data.txt
drwxr-xr-x - ambari-qa hdfs 0 2016-04-16 17:48 /tmp/tezsmokeinput
drwxr-xr-x - ambari-qa hdfs 0 2016-04-16 17:48 /tmp/tezsmokeoutput
Any help appreciated.
Thanks-
Created 04-22-2016 01:15 AM
Hey Artem. I was able to get it working by doing:
CREATE TABLE TwitterTest( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/tmp/sample_twitter_data.txt' OVERWRITE INTO TABLE TwitterTest;
However, when I try to do the query you did I get the following error:
Error while compiling statement: FAILED: ParseException line 1:39 cannot recognize input near 'user' '.' 'name' in selection target [ERROR_STATUS]
When I look at the table structure by doing a simple SELECT * FROM TwitterTest LIMIT 10; I see that all the fields inside the user struct are inside the same column (twittertest.user) ... As for the other fields (createddate, geolocation, twittermessage) they have their own columns.... Is that normal?
Thanks-
Created 04-22-2016 02:06 AM
1. /user/root is my own user directory, you're doing it in /tmp and most likely you will have permissions issues, which you are. Make sure user running query has rwx on directory in schema
2. if you're copying my schema one to one, then you need one more thing as "user" is a keyword in Hive 1.2. My full examples are here https://github.com/dbist/workshops/tree/master/hive/JSON
run this before create schema and again before select on the table
SET hive.support.sql11.reserved.keywords=false;
3. That depends on your JSON, some columns are part of inner structures and some are on their own. It is normal.
Created 04-22-2016 02:41 AM
Thanks Artem. Point 2 solved it! My JSON is identical to yours.
Created 04-22-2016 07:46 AM
It's a workaround but in general I would avoid reserved words like 'user' in schema because you always have to set that property otherwise. Rename your data to 'usr'. If this solves your question, please accept the best answer to close this thread.
Created 04-22-2016 02:05 AM
You can use lateral view & explode or inline keyword to get at the data in the struct column.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
https://www.qubole.com/resources/cheatsheet/hive-function-cheat-sheet/
cheers,
Andrew
Created 04-22-2016 02:42 AM
Thanks for the info, Andrew!