Reply
New Contributor
Posts: 2
Registered: ‎02-18-2014

Morphlines, attempting to index avro-serialized records from HBase into solrcloud

Greetings,

 

I am using CDH4.4. I have an app currently running which serializes records into a single column in hbase via avro. I am in the process of moving my current solr index of this table into solrcloud, so I'm testing the MapReduceIndexerTool to do bulk indexing of the whole table. I have a very simple morphlines file which currently uses "extractHBaseCells" to read records from HBase.

 

I set this up a tracer proof-of-concept, only indexing the rowkey => id and stuffing the avro blob into another field, just to verify that I could get data from HBase over to my collection in SolrCloud, and that works. But I'd like to parse the avro and stick those values into their own fields on the solrdocuments before submitting them to solrcloud. But it would seem that the nature of "extractHBaseCells" prevents this. If there were an hbase reader command that emitted more general output that could then flow into the avro commands in morphlines, I am confident I could solve my own problem.

 

So are there any known workarounds for parsing avro that has been stored in HBase or possibly some more morphlines commands that could address this?

 

Thanks,

Rob

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Morphlines, attempting to index avro-serialized records from HBase into solrcloud

Here is an example: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Gu...

 

Simply uncomment the following commands in the example:

 

      #for avro use with type : "byte[]" in extractHBaseCells mapping above

      #{ readAvroContainer {} } 

      #{ 

      #  extractAvroPaths {

      #    paths : { 

      #      data : /user_name      

      #    }

      #  }

      #}

 

Now, this has the extractHBaseCells command pipe data into readAvroContainer command, which in turn pipes into the extractAvroPaths command, which in turn pipes data back into the Lily HBase Indexer, which in turn sends the data to SolrCloud.

New Contributor
Posts: 2
Registered: ‎02-18-2014

Re: Morphlines, attempting to index avro-serialized records from HBase into solrcloud

Thanks. I dug into the code, I just missed the special case for byte streams.

 

However, I am still experiencing an issue. Each record fails processing:

 

WARN morphline.LocalMorphlineResultToSolrMapper: Morphline /home/rlong3/feature_indexing/feature_morphline.conf@null failed to process record: {feature_rowkey=[[B@53e6978d]}

I'm running MapReduceIndexerTool with the --dry-run option for now, and seeing these warnings in stdout. I'm at a loss with this result. When I have the avro stuff commented out, I am able to process every record, I just can't get the stuff I need out of the avro.

 

Here's the relevant parts of my morphlines file:

 

commands : [
      {
        extractHBaseCells {
          mappings : [
            {
              inputColumn : "F:feature"
              outputField : "feature_rowkey"
              type : "byte[]"
              source : value
            }
          ]
        }
      }

       # Parse Avro container file and emit a record for each avro object
      { readAvroContainer {}}

      {
        extractAvroPaths {
          flatten : false
          paths : {
            source : /source
            #text : /text
          }
        }
      }

 Where "source" is field in my solr schema and in my avro record.  I realize there's not a whole lot to go on here. If you see something obvious, please let me know.

Highlighted
New Contributor
Posts: 5
Registered: ‎09-20-2017

Re: Morphlines, attempting to index avro-serialized records from HBase into solrcloud

How do you create an Avro object in a variable.  I know how to create the object as it's written to a file, but can't figure out how to put an Avro object into a variable that I can then put into an Hbase column.

 

Thanks

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.