Reply
Highlighted
New Contributor
Posts: 4
Registered: ‎06-29-2017

Dynamic posting json document to Solr

[ Edited ]

I am using morphline, flume, and solr (managed by cloudera manager) to perform ETL process. A collection in solr is created as schemaless, and it works well with plain text file. 

 

However when posting json document to solr, it can't dynamic extract json document. Posted json document may look like 

 

{
        "_attachment_mimetype": [
          "json/java+memory"
        ],
        "id": "90377be2-845b-4c68-b2fc-77e1242ab323",
        "timestamp": [
          1498735506307
        ],
        "_attachment_body": [
          "[]",
          "[[], []]",
          "[]",
          "[]",
          "[]",
          "[]",
          "[]",
          "[]",
          "[]"
        ],
        "_version_": 1571538085725864000
      }

So I add script to the morphline conf file with content 

       java {
          imports : """
            import com.fasterxml.jackson.databind.JsonNode;
            import java.util.Map;
            import java.util.Iterator;
            import org.kitesdk.morphline.base.Fields;
          """
          code : """
            JsonNode rootNode = (JsonNode) record.getFirstValue(Fields.ATTACHMENT_BODY);
            Iterator<Map.Entry<String, JsonNode>> fields = rootNode.fields();
            while(fields.hasNext()) {
              Map.Entry<String, JsonNode> entry = fields.next();
              String key = entry.getKey();
              JsonNode value = entry.getValue();
              record.put(key, value);
            }
            logger.info("[Morphline][Info] Record: {}", record);
            return child.process(record);
          """
        }
{ removeFields { blacklist : ["regex:_attachment_.*"] } }

I can find the extracted key value data in the log file 

 

[Morphline][Info] Record: {_attachment_body=[{"type":"drama","name":"something about nothing","comment1":"boring!"}], _attachment_mimetype=[json/java+memory], comment1=["boring!"], id=[948581e1-667b-4788-b3b2-1ab8c1ac2c00], name=["something about nothing"], type=["drama"]}

But the posted json document in Solr collection doesn't contains extracted key value data. The posted json document in Solr actually looks like

 

    {
        "id": "948581e1-667b-4788-b3b2-1ab8c1ac2c00",
        "timestamp": [
          1498742262454
        ],
        "_version_": 1571545170021712000
      }

Only timestamp, id generated by morphline are posted. How can I post arbitrary json document with it's fields dynamically extracted? I may post arbitrary json document to the collection, and serach it. That's why I create a collection with schemaless. 

 

Thanks

 

Environment: Solr 4.10.3, morphline kite sdk 1.1.0, couldera manager 5.11.0

Morphline conf file content

 

morphlines : [
  {

    id : morphline_json
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands : [
      { readJson: {} }
      { generateUUID { field : id } }
      {
        java {
          imports : """
            import com.fasterxml.jackson.databind.JsonNode;
            import java.util.Map;
            import java.util.Iterator;
            import org.kitesdk.morphline.base.Fields;
          """
          code : """
            JsonNode rootNode = (JsonNode) record.getFirstValue(Fields.ATTACHMENT_BODY);
            Iterator<Map.Entry<String, JsonNode>> fields = rootNode.fields();
            while(fields.hasNext()) {
              Map.Entry<String, JsonNode> entry = fields.next();
              String key = entry.getKey();
              JsonNode value = entry.getValue();
              record.put(key, value);
            }
            logger.info("[Morphline][Info] Record: {}", record);
            return child.process(record);
          """
        }
      }
      { addCurrentTime {
          field: timestamp
          preserveExisting: true
        }
      }
      { removeFields { blacklist : ["regex:_attachment_.*"] } }
      { logInfo { format : "[Morphline][Info] output record: {}", args : ["@{}"] } }
      { loadSolr { solrLocator {
          collection : mycollection
          zkHost : "127.0.0.1:2181/solr"
        }
      }}
    ]
  }
]

 

 

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.