Support Questions
Find answers, ask questions, and share your expertise

Hive table creation with nested json

New Contributor

I'm running the 2.6.5 sandbox and attempting to create a table in Hive with the following nested json which contains non-utf8 characters and carriage returns (^M)

{"user":{"userlocation":"Cinderford, Gloucestershire","id":230231618,"name":"Aimee","screenname":"Aimee_Cottle","geoenabled":true},"tweetmessage":"Gastroenteritis has pretty much killed me this week 😞 off work for a few days whilst I recover!","createddate":"2013-06-20T12:08:14","geolocation":null} {"user":{"userlocation":"Garena ID : NuraBlazee","id":635239939,"name":"Axyraf.","screenname":"Asyraf_Fauzi","geoenabled":false},"tweetmessage":"RT @abhigyantweets: Can't stop a natural disaster but so many lives in U'khand wouldn't have been lost if there was disaster preparedness. É","createddate":"2013-06-20T12:08:16","geolocation":null}{"user":{"userlocation":"Gemert,Netherlands","id":21418083,"name":"Ad van Steenbruggen","screenname":"torment00","geoenabled":true},"tweetmessage":"? Listening to 'The Constant' by 'Anthrax' from 'Worship Music","createddate":"2013-06-20T12:08:20","geolocation":null}

I'm assuming the carriage returns needs to be dealt with prior to ingest and so they were removed prior to my attempts. These are the methods I've tried so far.....

create table twitter (
  user struct<userlocation:string, id:string, name:string, screenname:string, geoenabled:string, tweetmessage:string, createdate:string, geolocation:string>,
  tweetmessage string, createddate string, geolocation string 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='UTF-8');
load data inpath '/tmp/twitter.txt' into table twitter;


CREATE TABLE twitter ( json string );
LOAD DATA LOCAL INPATH  '/tmp/twitter.txt' INTO TABLE twitter;

select get_json_object(twitter.json, '$.user.userlocation') as userlocation, 
       get_json_object(twitter.json, '$.user.screenname') as screenname,
       get_json_object(twitter.json, '$.user.tweetmessage') as tweetmessage,
       get_json_object(twitter.json, '$.geolocation') as geolocation
from twitter;

In the first method, all data is written to the user column and twittermessage, createdate and geolocation are null. In the second method, the result of the select statement is:

userlocation	screenname	tweetmessage	geolocation

Cinderford, Gloucestershire	Aimee_Cottle	null	null
null	null	null	null
Garena ID : NuraBlazee	Asyraf_Fauzi	null	null
null	null	null	null

I'm assuming at this stage I'm missing some fundamental principle associated with Hive ingestion but I'm stumped as to what it is. Any help would be gratefully appreciated.