Support Questions

Find answers, ask questions, and share your expertise

Querying JSON data using Hive

avatar

I am using https://github.com/rcongiu/Hive-JSON-Serde to query the JSON data via hive.

As part of testing, I am using an external table to query the JSON plain text file in HDFS.

i am able to query the data from hive using select, However when i do select * from JSON_EXTERNAL_TABLE limit 1, the output is an Invalid JSON though the message in HDFS is a valid JSON. Is this an expected one ?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

JSON Serde works on 1 line at a time, with each line being parsed independently. Is your JSON encoded to fit into 1 line for each record in your stream?

View solution in original post

16 REPLIES 16

avatar
Expert Contributor

JSON Serde works on 1 line at a time, with each line being parsed independently. Is your JSON encoded to fit into 1 line for each record in your stream?

avatar

@gopal, Yes. The input JSON is all in single line.

Enabled debugging, to see if the select * on hive console is using a different Serde than the one defined in the Table. however it looks thats not the case,

Enabled the debug option in hive cli

serialization.lib=org.openx.data.jsonserde.JsonSerDe,

avatar
New Contributor

What if my Json record is spread across multiple lines?

avatar
Maybe use a different SERDE?
  1. ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' or
  2. ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' from https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master

avatar

@Ancil McBarnett, i am using the 2nd one here. Which is quite popular and well tested as it appers

avatar
Master Mentor

load the JSON as string with column name "json"

SELECT get_json_object(json, '$.id') AS ID,get_json_object(json, '$.person.last_name') AS LastName,get_json_object(json, '$.person.first_name') AS FirstName,get_json_object(json, '$.person.email') AS email,get_json_object(json, '$.person.location.address') AS Address,get_json_object(json, '$.person.location.city') AS City,get_json_object(json, '$.person.location.state') AS State,get_json_object(json, '$.person.location.zipcode') AS Zip,get_json_object(json, '$.person.text') AS Text,get_json_object(json, '$.person.url') AS URLFROM HBASE_JSON_TABLE;ORSELECT id, lastName, firstName, email, city, state, text, url FROM hbase_json_table A LATERAL VIEW json_tuple(A.json, 'id', 'person') B AS id, personLATERAL VIEW json_tuple(person, 'last_name', 'first_name', 'email', 'text', 'url', 'location') C as lastName, firstName, email, text, url, locLATERAL VIEW json_tuple(loc, 'city', 'state') D AS city, state;

OR

CREATE EXTERNAL TABLE json_serde_table (
  id string,
  person struct<email:string, first_name:string, last_name:string, location:struct<address:string, city:string, state:string, zipcode:string>, text:string, url:string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/tmp/json/';
SELECT id, person.first_name, person.last_name, person.email, 
person.location.address, person.location.city, person.location.state, 
person.location.zipcode, person.text, person.url
FROM json_serde_table LIMIT 5;

avatar

@Artem Ervits. Have used the last create external statement . If i do select * from json_serde_table ; i get a JSON output.but not in a valid format.

avatar
Master Mentor

can you paste the output

avatar
Super Collaborator

hi Artem

I used your method but I am getting error can you help please?

CREATE EXTERNAL TABLE tweetdata3(created_at STRING,
text STRING,
  person STRUCT< 
     screen_name:STRING,
     name:STRING,
     locations:STRING,
     description:STRING,
     created_at:STRING,
     followers_count:INT,
     url:STRING>
) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'  location '/user/flume/tweets';

hive>
    >
    > select person.name,person.locations, person.created_at, text from tweetdata3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: java.io.ByteArrayInputStream@2bc779ed; line: 1, column: 2]
Time taken: 0.274 seconds
hive>