- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Creating a HIVE table with org.apache.hcatalog.data.JsonSerDe
- Labels:
-
Apache Hive
Created ‎04-21-2016 11:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello there,
I am creating a table for storing json twitter data. I see different ways of using org.apache.hcatalog.data.JsonSerDe for it, but what would be the simpler process to use org.apache.hcatalog.data.JsonSerDe for this purpose.
Where should I get org.apache.hcatalog.data.JsonSerDe, and how to integrate it into my HIVE instance?
Thanks!
Created ‎04-21-2016 03:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
considering you have the following data
{ "user": { "userlocation": "California, Santa Clara", "id": 222222, "name": "Hortonworks", "screenname": "hortonworks", "geoenabled": true }, "tweetmessage": "Learn more about #Spark in #HDP 2.3 with @Hortonworks founder @acmurthy in this video overview http://bit.ly/1gOyr9w #hadoop", "createddate": "2015-07-24T16:30:33", "geolocation": null}
your schema would look like so with JsonSerDe
CREATE EXTERNAL TABLE tweets ( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/'; SELECT DISTINCT tweetmessage, user.name, createddate FROM tweets WHERE user.name = 'Hortonworks' ORDER BY createddate;
Created ‎04-21-2016 03:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
considering you have the following data
{ "user": { "userlocation": "California, Santa Clara", "id": 222222, "name": "Hortonworks", "screenname": "hortonworks", "geoenabled": true }, "tweetmessage": "Learn more about #Spark in #HDP 2.3 with @Hortonworks founder @acmurthy in this video overview http://bit.ly/1gOyr9w #hadoop", "createddate": "2015-07-24T16:30:33", "geolocation": null}
your schema would look like so with JsonSerDe
CREATE EXTERNAL TABLE tweets ( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/'; SELECT DISTINCT tweetmessage, user.name, createddate FROM tweets WHERE user.name = 'Hortonworks' ORDER BY createddate;
Created ‎04-22-2016 12:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Arten! quick question what does the folder '/user/root/' refers to? In my example I am specifying the /tmp/ folder where I store my twitter sample file.
CREATE TABLE TwitterExample_0( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/tmp/';
I am getting this error when I try a simple SELECT * FROM TwitterExample_0 LIMIT 10;
trace":"org.apache.ambari.view.hive.client.HiveErrorStatusException: H170 Unable to fetch results. java.io.IOException: org.apache.hadoop.security.AccessControlException: Permission denied: user=admin, access=READ_EXECUTE, inode=\"/tmp/ambari-qa\":ambari-qa:hdfs:drwx------\n\tat....
----------------------------------------------------
[hdfs@ip-172-31-34-25 ~]$ hadoop fs -chmod 777 /tmp/
[hdfs@ip-172-31-34-25 ~]$ hadoop fs -ls /tmp/
drwx------ - ambari-qa hdfs 0 2016-04-16 17:47 /tmp/ambari-qa
drwxr-xr-x - hdfs hdfs 0 2016-04-16 17:43 /tmp/entity-file-history
drwx-wx-wx - hive hdfs 0 2016-04-16 20:10 /tmp/hive
-rwxr-xr-x 3 hdfs hdfs 1902 2016-04-16 17:45 /tmp/id1fac3f21_date451616
-rwxr-xr-x 3 ambari-qa hdfs 1902 2016-04-16 17:50 /tmp/idtest.ambari-qa.1460843437.95.in
-rwxr-xr-x 3 ambari-qa hdfs 957 2016-04-16 17:50 /tmp/idtest.ambari-qa.1460843437.95.pig
-rwxrwxrwx 3 hdfs hdfs 2755 2016-04-21 20:35 /tmp/sample_twitter_data.txt
drwxr-xr-x - ambari-qa hdfs 0 2016-04-16 17:48 /tmp/tezsmokeinput
drwxr-xr-x - ambari-qa hdfs 0 2016-04-16 17:48 /tmp/tezsmokeoutput
Any help appreciated.
Thanks-
Created ‎04-22-2016 01:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Artem. I was able to get it working by doing:
CREATE TABLE TwitterTest( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/tmp/sample_twitter_data.txt' OVERWRITE INTO TABLE TwitterTest;
However, when I try to do the query you did I get the following error:
Error while compiling statement: FAILED: ParseException line 1:39 cannot recognize input near 'user' '.' 'name' in selection target [ERROR_STATUS]
When I look at the table structure by doing a simple SELECT * FROM TwitterTest LIMIT 10; I see that all the fields inside the user struct are inside the same column (twittertest.user) ... As for the other fields (createddate, geolocation, twittermessage) they have their own columns.... Is that normal?
Thanks-
Created ‎04-22-2016 02:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. /user/root is my own user directory, you're doing it in /tmp and most likely you will have permissions issues, which you are. Make sure user running query has rwx on directory in schema
2. if you're copying my schema one to one, then you need one more thing as "user" is a keyword in Hive 1.2. My full examples are here https://github.com/dbist/workshops/tree/master/hive/JSON
run this before create schema and again before select on the table
SET hive.support.sql11.reserved.keywords=false;
3. That depends on your JSON, some columns are part of inner structures and some are on their own. It is normal.
Created ‎04-22-2016 02:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Artem. Point 2 solved it! My JSON is identical to yours.
Created ‎04-22-2016 07:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's a workaround but in general I would avoid reserved words like 'user' in schema because you always have to set that property otherwise. Rename your data to 'usr'. If this solves your question, please accept the best answer to close this thread.
Created ‎04-22-2016 02:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use lateral view & explode or inline keyword to get at the data in the struct column.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
https://www.qubole.com/resources/cheatsheet/hive-function-cheat-sheet/
cheers,
Andrew
Created ‎04-22-2016 02:42 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the info, Andrew!
