Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Expert Contributor

Very well know way to handle JSON is to use JSON SerDe which originated from hcatalog. There is another interesting rather unconventional method to handle JSON data in HIVE. json_tuple and LATERAL VIEW. Table here only has one column that loads JSON data as a single string.

json_tuple() is a User defined Table Function ( UDTF ) introduced in Hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function.In the words of ASF "A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias".

Consider below is sample data line:

{"user":{"location":"","id":1514008171,"name":"Auzzie Jet","screenname":"metalheadgrunge","geoenabled":false},"tweetmessage":"Anthrax - Black - Sunglasses hell yah\n http://t.co/qCNjba57Dm","createddate":"2013-06-20T12:08:44","geolocation":null}

Table creation:

CREATE EXTERNAL table tweets{  tweet STRING }
LOAD data local inpath /pathtotxtfile/tweets.txt into table tweets

My goal is to obtain the tweetmessage of the user: Auzzie Jet. Using json_tuple and lateral view, below will be the query on the json tweet data.

select t2.name, t1.tweetmessage from tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage' ) t1  

LATERAL VIEW json_tuple(t1.user, 'name', 'location') t2 where t2.name="Auzzie Jet";

4788-json-tree.png

Imagine the tweets being parsed as a JSON tree in a LATERAL VIEW using the utility json_tuple.The first instance gives us a virtual table with with two columns user and tweetmessage. Similar process is iterated or repeated to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns name and location. And then we query to ask for tweetmessage by the particular user.

"The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names." by Apache

39,417 Views
Comments
avatar
New Contributor
select t2.name, t1.tweetmessage from tweets t LATERAL VIEW json_tuple(t.tweet,'user','tweetmessage') t1
  • here you have mentioned t2.name and t1.message and in diagram you have mentioned as t1 user and t2 tweetmessage, can you or someone explain a little about this ?
avatar
New Contributor

13232-capture.png

raw data

13234-capture.png

any suggestion recieving error?

avatar
Contributor

Correction for syntax using beeline :

Formated data for referecence

{
  "user": {
    "location": "",
    "id": 1514008171,
    "name": "Auzzie Jet",
    "screenname": "metalheadgrunge",
    "geoenabled": false
  },
  "tweetmessage": "Anthrax - Black - Sunglasses hell yah\n http://t.co/qCNjba57Dm",
  "createddate": "2013-06-20T12:08:44",
  "geolocation": null
}
<br>

From here on I refer to

  • `user`, `tweetmessage`, `createddate`, `geolocation` => Level 1 fields
  • `location`, `id`, `name`, `screenname`, `geoenabled` => Level 2 fields

Corrected select query :

SELECT t2.name, t1.tweetmessage 
FROM tweets t 
	LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage' ) t1 as `user`, `tweetmessage` 
	LATERAL VIEW json_tuple(t1.`user`, 'name', 'location') t2 as `name`, `location` 
where t2.`name`="Auzzie Jet"
;

Other examples :

Select level 1 : `user`, `tweetmessage`, `createddate`, `geolocation`

SELECT t1.`user`, t1.tweetmessage, t1.createddate, t1.geolocation
FROM tweets t 
	LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 
	as `user`, `tweetmessage`, `createddate`, `geolocation`
;

Select level 1 and 2 => Flatten everything

SELECT t2.location, t2.id, t2.name, t2.screenname, t2.geoenabled, t1.tweetmessage, t1.createddate, t1.geolocation
FROM tweets t 
	LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 
	as `user`, `tweetmessage`, `createddate`, `geolocation`
	LATERAL VIEW json_tuple(t1.`user`, 'location', 'id', 'name', 'screenname', 'geoenabled' ) t2 
	as `location`, `id`, `name`, `screenname`, `geoenabled` 
;