Created on 06-07-2016 03:33 AM - edited 08-17-2019 12:11 PM
Very well know way to handle JSON is to use JSON SerDe which originated from hcatalog. There is another interesting rather unconventional method to handle JSON data in HIVE. json_tuple and LATERAL VIEW. Table here only has one column that loads JSON data as a single string.
json_tuple() is a User defined Table Function ( UDTF ) introduced in Hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function.In the words of ASF "A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias".
Consider below is sample data line:
{"user":{"location":"","id":1514008171,"name":"Auzzie Jet","screenname":"metalheadgrunge","geoenabled":false},"tweetmessage":"Anthrax - Black - Sunglasses hell yah\n http://t.co/qCNjba57Dm","createddate":"2013-06-20T12:08:44","geolocation":null}
Table creation:
CREATE EXTERNAL table tweets{ tweet STRING }
LOAD data local inpath /pathtotxtfile/tweets.txt into table tweets
My goal is to obtain the tweetmessage of the user: Auzzie Jet. Using json_tuple and lateral view, below will be the query on the json tweet data.
select t2.name, t1.tweetmessage from tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage' ) t1 LATERAL VIEW json_tuple(t1.user, 'name', 'location') t2 where t2.name="Auzzie Jet";
Imagine the tweets being parsed as a JSON tree in a LATERAL VIEW using the utility json_tuple.The first instance gives us a virtual table with with two columns user and tweetmessage. Similar process is iterated or repeated to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns name and location. And then we query to ask for tweetmessage by the particular user.
"The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names." by Apache
Created on 03-04-2017 05:21 PM
select t2.name, t1.tweetmessage from tweets t LATERAL VIEW json_tuple(t.tweet,'user','tweetmessage') t1
Created on 03-04-2017 05:40 PM - edited 08-17-2019 12:10 PM
raw data
any suggestion recieving error?
Created on 08-14-2017 10:55 AM
Correction for syntax using beeline :
Formated data for referecence
{ "user": { "location": "", "id": 1514008171, "name": "Auzzie Jet", "screenname": "metalheadgrunge", "geoenabled": false }, "tweetmessage": "Anthrax - Black - Sunglasses hell yah\n http://t.co/qCNjba57Dm", "createddate": "2013-06-20T12:08:44", "geolocation": null } <br>
From here on I refer to
Corrected select query :
SELECT t2.name, t1.tweetmessage FROM tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage' ) t1 as `user`, `tweetmessage` LATERAL VIEW json_tuple(t1.`user`, 'name', 'location') t2 as `name`, `location` where t2.`name`="Auzzie Jet" ;
Other examples :
Select level 1 : `user`, `tweetmessage`, `createddate`, `geolocation`
SELECT t1.`user`, t1.tweetmessage, t1.createddate, t1.geolocation FROM tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 as `user`, `tweetmessage`, `createddate`, `geolocation` ;
Select level 1 and 2 => Flatten everything
SELECT t2.location, t2.id, t2.name, t2.screenname, t2.geoenabled, t1.tweetmessage, t1.createddate, t1.geolocation FROM tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'tweetmessage', 'createddate', 'geolocation' ) t1 as `user`, `tweetmessage`, `createddate`, `geolocation` LATERAL VIEW json_tuple(t1.`user`, 'location', 'id', 'name', 'screenname', 'geoenabled' ) t2 as `location`, `id`, `name`, `screenname`, `geoenabled` ;