Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Creating a HIVE table with org.apache.hcatalog.data.JsonSerDe

avatar
Expert Contributor

Hello there,

I am creating a table for storing json twitter data. I see different ways of using org.apache.hcatalog.data.JsonSerDe for it, but what would be the simpler process to use org.apache.hcatalog.data.JsonSerDe for this purpose.

Where should I get org.apache.hcatalog.data.JsonSerDe, and how to integrate it into my HIVE instance?

Thanks!

1 ACCEPTED SOLUTION

avatar
Master Mentor

considering you have the following data

{  "user": {    "userlocation": "California, Santa Clara",    "id": 222222,    "name": "Hortonworks",    "screenname": "hortonworks",    "geoenabled": true  },  "tweetmessage": "Learn more about #Spark in #HDP 2.3 with @Hortonworks founder @acmurthy in this video overview http://bit.ly/1gOyr9w  #hadoop",  "createddate": "2015-07-24T16:30:33",  "geolocation": null}

your schema would look like so with JsonSerDe

CREATE EXTERNAL TABLE tweets (  createddate string,  geolocation string,  tweetmessage string,  `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
	ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/';

SELECT DISTINCT tweetmessage, user.name, createddate 
	FROM tweets WHERE user.name = 'Hortonworks'
		ORDER BY createddate;

View solution in original post

8 REPLIES 8

avatar
Master Mentor

considering you have the following data

{  "user": {    "userlocation": "California, Santa Clara",    "id": 222222,    "name": "Hortonworks",    "screenname": "hortonworks",    "geoenabled": true  },  "tweetmessage": "Learn more about #Spark in #HDP 2.3 with @Hortonworks founder @acmurthy in this video overview http://bit.ly/1gOyr9w  #hadoop",  "createddate": "2015-07-24T16:30:33",  "geolocation": null}

your schema would look like so with JsonSerDe

CREATE EXTERNAL TABLE tweets (  createddate string,  geolocation string,  tweetmessage string,  `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
	ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/';

SELECT DISTINCT tweetmessage, user.name, createddate 
	FROM tweets WHERE user.name = 'Hortonworks'
		ORDER BY createddate;

avatar
Expert Contributor

Thanks Arten! quick question what does the folder '/user/root/' refers to? In my example I am specifying the /tmp/ folder where I store my twitter sample file.

CREATE TABLE TwitterExample_0( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/tmp/';

I am getting this error when I try a simple SELECT * FROM TwitterExample_0 LIMIT 10;

trace":"org.apache.ambari.view.hive.client.HiveErrorStatusException: H170 Unable to fetch results. java.io.IOException: org.apache.hadoop.security.AccessControlException: Permission denied: user=admin, access=READ_EXECUTE, inode=\"/tmp/ambari-qa\":ambari-qa:hdfs:drwx------\n\tat....

----------------------------------------------------

[hdfs@ip-172-31-34-25 ~]$ hadoop fs -chmod 777 /tmp/

[hdfs@ip-172-31-34-25 ~]$ hadoop fs -ls /tmp/

drwx------ - ambari-qa hdfs 0 2016-04-16 17:47 /tmp/ambari-qa

drwxr-xr-x - hdfs hdfs 0 2016-04-16 17:43 /tmp/entity-file-history

drwx-wx-wx - hive hdfs 0 2016-04-16 20:10 /tmp/hive

-rwxr-xr-x 3 hdfs hdfs 1902 2016-04-16 17:45 /tmp/id1fac3f21_date451616

-rwxr-xr-x 3 ambari-qa hdfs 1902 2016-04-16 17:50 /tmp/idtest.ambari-qa.1460843437.95.in

-rwxr-xr-x 3 ambari-qa hdfs 957 2016-04-16 17:50 /tmp/idtest.ambari-qa.1460843437.95.pig

-rwxrwxrwx 3 hdfs hdfs 2755 2016-04-21 20:35 /tmp/sample_twitter_data.txt

drwxr-xr-x - ambari-qa hdfs 0 2016-04-16 17:48 /tmp/tezsmokeinput

drwxr-xr-x - ambari-qa hdfs 0 2016-04-16 17:48 /tmp/tezsmokeoutput

Any help appreciated.

Thanks-

avatar
Expert Contributor

Hey Artem. I was able to get it working by doing:

CREATE TABLE TwitterTest( createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/tmp/sample_twitter_data.txt' OVERWRITE INTO TABLE TwitterTest;

However, when I try to do the query you did I get the following error:

Error while compiling statement: FAILED: ParseException line 1:39 cannot recognize input near 'user' '.' 'name' in selection target [ERROR_STATUS]

When I look at the table structure by doing a simple SELECT * FROM TwitterTest LIMIT 10; I see that all the fields inside the user struct are inside the same column (twittertest.user) ... As for the other fields (createddate, geolocation, twittermessage) they have their own columns.... Is that normal?

Thanks-

avatar
Master Mentor

1. /user/root is my own user directory, you're doing it in /tmp and most likely you will have permissions issues, which you are. Make sure user running query has rwx on directory in schema

2. if you're copying my schema one to one, then you need one more thing as "user" is a keyword in Hive 1.2. My full examples are here https://github.com/dbist/workshops/tree/master/hive/JSON

run this before create schema and again before select on the table

SET hive.support.sql11.reserved.keywords=false;

3. That depends on your JSON, some columns are part of inner structures and some are on their own. It is normal.

avatar
Expert Contributor

Thanks Artem. Point 2 solved it! My JSON is identical to yours.

avatar
Master Mentor

It's a workaround but in general I would avoid reserved words like 'user' in schema because you always have to set that property otherwise. Rename your data to 'usr'. If this solves your question, please accept the best answer to close this thread.

avatar
Contributor

You can use lateral view & explode or inline keyword to get at the data in the struct column.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

https://www.qubole.com/resources/cheatsheet/hive-function-cheat-sheet/

cheers,

Andrew

avatar
Expert Contributor

Thanks for the info, Andrew!