Support Questions

Find answers, ask questions, and share your expertise

json and hive

Expert Contributor

I usually have deeply nested JSON data and I somehow do not find JSON serde very helpful in hive.

Having to create the table manually and defining columns looks cumbersome.

I somewhere read AVRO files can be used for storing JSON data in hive. I have tried changing JSON to data tables and saving as a file through R, but nested frames are converted to lists that way and that data can't be analysed in hive.

What is really the best way to analyse JSON data in R. All the ways so far look like hacks and nothing really appeals much for JSON and hive integration.help please

Could I please be guided with some ways to do it?

1 REPLY 1

Cloudera Employee

You may want to take a look at hive-json to extract the json schema (from: https://github.com/hortonworks/hive-json)

i've run:

bin/find-json-schema ~/projects/structor/modules/druid_overlord/files/TimeSeriesQuery.json

and creating the table, but set jsonserde rowformat:

create table tbl (
  aggregations array <struct <
      fieldName: string,
      name: string,
      type: string>>,
  dataSource string,
  granularity string,
  intervals array <string>,
  queryType string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
;

load the data:

LOAD DATA LOCAL INPATH "/home/kirk/projects/structor/modules/druid_overlord/files/TimeSeriesQuery.json" INTO TABLE tbl;

switch to orc format (this is a columnar format, it will be better this way):

create table tbl_orc (
  aggregations array <struct <
      fieldName: string,
      name: string,
      type: string>>,
  dataSource string,
  granularity string,
  intervals array <string>,
  queryType string
)
STORED AS ORC
;
insert overwrite table tbl_orc select * from tbl;
select count(*) from tbl_orc where queryType = 'timeseries';

start querying tbl_orc...

i'm only aware of it as of today...but it works like a charm 🙂