Reply
Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Kite dataset for nested JSON document

I got a nested JSON document for this form that I would like to transform and store into the dataset created by Kite:

 

{
 "uid": 29153333,
 "somefield": "somevalue",
 "options": [
   {
     "item1_lvl2": "a",
     "item2_lvl2": [
       {
         "item1_lvl3": "x1",
         "item2_lvl3": "y1"
       },
       {
         "item1_lvl3": "x2",
         "item2_lvl3": "y2"
       }
     ]
   }
 ]
}

How does one go about storing and querying these types of documents?

 

'm planning on ingesting using Flume's Kite dataset sink and will be relying on extractJsonPath and toAvro morphline commands to transform the JSON documents. Is creating a dataset based on Avro schema using complex types supported? 

 

Thanks!

Explorer
Posts: 11
Registered: ‎06-05-2015

Re: Kite dataset for nested JSON document

buntu: did you ever come up with a solution for this?
Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Kite dataset for nested JSON document

Currently I'm relying on Kite CLI to generate the Avro schema and then do a json import providing the json file along with the generated schema:

   http://kitesdk.org/docs/current/cli-reference.html#json-schema

   http://kitesdk.org/docs/current/cli-reference.html#json-import

 

Let me know if there are any other alternative ways to handle the ingestion and/or querying the dataset.

 

 

Explorer
Posts: 11
Registered: ‎06-05-2015

Re: Kite dataset for nested JSON document

[ Edited ]

Specifically we have json that has nested records (and thus of course avro schemas that reflect that nesting) and we can't figure out how to use readJson + extractJsonPaths + toAvro + writeAvroAsByteArray to process this data because toAvro appears to NOT support nested records.

 

Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Kite dataset for nested JSON document

Yes, it doesn't seem to support nested json. So the ingestion process writes the JSON records to HDFS and then I schedule a periodic job to import the files to the Kite dataset.

 

Few other options:

- Read the JSON using Apache Spark and write as Parquet and operate on the data

- Apache Nifi is another option that was suggested but I havn't had a chance to play around with it

Explorer
Posts: 11
Registered: ‎06-05-2015

Re: Kite dataset for nested JSON document

Thanks again for the responses.
Tom
New Contributor
Posts: 2
Registered: ‎04-04-2018

Re: Kite dataset for nested JSON document

  This is my solution. I hope that will be helpful.

morphlines: [
  {
    id: convertJsonToAvro
    importCommands: [ "org.kitesdk.**" ]
    commands: [
      # read the JSON blob
      { readJson: {} }
	  
	  # java code
	  {
			  java { 
					imports : """
					  import com.fasterxml.jackson.databind.JsonNode;
					  import com.fasterxml.jackson.databind.ObjectMapper;
					  import org.kitesdk.morphline.base.Fields;
					  import java.io.IOException;
					  import java.util.Set;
					  import java.util.ArrayList;
					  import java.util.Iterator;
					  import java.util.List;
					  import java.util.Map;
					"""

					code : """
					  String jsonStr = record.getFirstValue(Fields.ATTACHMENT_BODY).toString();
					  ObjectMapper mapper = new ObjectMapper();
					  Map<String, Object> map = null;
					  try {
						  map = (Map<String, Object>)mapper.readValue(jsonStr, Map.class);
					  } catch (IOException e) {
						  e.printStackTrace();
					  }
					  Set<String> keySet = map.keySet();
					  for (String o : keySet) {
						  record.put(o, map.get(o));
					  }
					  return child.process(record);                   
					"""
	 
			  }
	  }
      
      # convert the extracted fields to an avro object
      # described by the schema in this field
      { toAvro {
        schemaFile: /etc/flume/conf/a1/like_user_event_realtime.avsc
      } }
      
      #{ logInfo { format : "loginfo: {}", args : ["@{}"] } }
  
      # serialize the object as avro
      { writeAvroToByteArray: {
        format: containerlessBinary
      } }
  
    ]
  }
]
Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.