Reply
Explorer
Posts: 25
Registered: ‎01-30-2014
Accepted Solution

index nested structure

I have an avro input source, going through a morphline into Solr. For example the following structure:

 

{

    "username" : "alex"

    "date" : "21-08-2014"

    "attachments" : [

        "documents" : [

              {

                  "title": "test"

                  "tags" : [ "a", "b", "c" ]

              },

              {

                  "optional1" : "test2"

                  "title" : "test2"

              } ],

        "source" : "school"

    ]

}

 

I can extract with extractAvroPath, like so:

 

...

{ extractAvroPaths {

     flatten : true

     paths : {

         /my_user : /username       # this works fine

         /my_attachments : "/attachments[]"

         /my_documents : "/attachments[]/documents[]"

     }

  }

}

.....

 

The problem being that /my_attachments or /my_documents now contain raw json/avro structures instead of a single field. How would I go about 'unwrapping' these fields so that they are all part of one solr document, while still retaining their context of the document they belong to? 

 

Highlighted
Explorer
Posts: 25
Registered: ‎01-30-2014

Re: index nested structure

To answer my own question: No this is not possible at this time, since Solr only started supporting nested documents since 4.5 and CDH5.1 is at 4.4 right now. Even if this becomes available in a future release the question will be whether or not this can be easily integrated and used with Kites morphlines.

 

For getting the job done I had to switch to using ElasticSearch, which does support nested documents and used Flume's ElasticSearchSink. Flume's official documetation on elasticsearch and avro is lacking and I had to patch flume code to get it working with UTF-8 charset and Json, but it's working nonetheless. Hope I can move this dataflow to the better integrated SolrCloud in the future.

 

Announcements