question Re: Indexing Avro documents with Lily in Archives of Support Questions (Read Only)

Indexing Avro documents with Lily

ortizg — Fri, 16 Sep 2022 09:31:23 GMT

I'm trying to use an tutorial from Cloudera. (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_hbase_batch_indexer.html)

I have a code to insert objects in Avro format in HBase and I want to insert them to Solr but I don't get anything.

I have been taking a look to the logs:

15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={0Name178721/data&colon;avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}
15/06/12 00:45:00 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 0 adds and 0 deletes
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={1Name134339/data&colon;avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}

So, I'm reaing them but I don't know why it isn't indexed anything in Solr. I guess that my morphline.conf is wrong.

morphlines : [
{
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**", "com.ngdata.**"]
    commands : [
      {
         extractHBaseCells {
          mappings : [
            {
             inputColumn : "data&colon;avroUser"
              outputField : "_attachment_body"
              type : "byte[]"
              source : value
            }
         ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      { readAvroContainer {} }
      {
        extractAvroPaths {
          flatten : true
          paths : {
            name : /name
          }
        }
      }
      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
 }
]

I wasn't sure if I had to have an "_attachment_body" field in Solr, but it seems that it isn't necessary, so I guess that readAvroContainer or extractAvroPaths are wrong. I have a "name" field in Solr and my avroUser has a "name" field as well.

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

Re: Indexing Avro documents with Lily

whosch — Fri, 12 Jun 2015 10:50:06 GMT

Maybe readAvroContainer fails because your avro data isn't contained in an avro container, in which case use readAvro command instead of readAvroContainer.

In any case, to automatically print diagnostic information such as the content of records as they pass through the morphline commands, consider enabling TRACE log level, for example by adding the following line to your log4j.properties file:

log4j.logger.org.kitesdk.morphline=TRACE

See http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#logTrace

This will also print which command failed where.

BTW, questions specific to Cloudera Search are best directed to search-user@cloudera.org via http://groups.google.com/a/cloudera.org/group/search-user

Wolfgang

Re: Indexing Avro documents with Lily

ortizg — Fri, 12 Jun 2015 12:17:31 GMT

Thanks, I was checking the generating of the avro and I had something wrong
and the avro objects were empty, just the schema. I fixed it and it seems
that I skiped that error.

I have used the TRACE level to see what it's happening and reviewed the log
for the mapReduce again and I got this error when it tries to index an
document to Solr

2015-06-12 05:06:49,843 INFO [IPC Server handler 10 on 45052]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from
attempt_1434101650719_0008_r_000000_3: Error: java.io.IOException: Batch
Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:290)
at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
*Caused by: org.apache.solr.common.SolrException: ERROR: [doc=0Name115457]
unknown field '_attachment_mimetype'*
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at .....

I have been reading
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
where it talks about the field *_attachment_mimetype. * Why is it trying to
index this field to Solr?

I executed the configuration as well with:

hadoop --config /etc/hadoop/conf jar
/usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf
/etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m'
--hbase-indexer-file /home/cloudera/morphline-hbase-mapper.xml --zk-host
127.0.0.1/solr --collection hbase-collection1 --dry-run --log4j
/home/cloudera/log4j.properties

And it looks that it works fine.

dryRun: SolrInputDocument(fields: [id=4Name249228,
*_attachment_mimetype=[avro/java+memory]*, _attachment_body=[{"name":
"4Name249228", "favorite_number": 41, "favorite_color": "Red27"}],
name=[Red27]])
16366 [main] TRACE
com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells
- beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 05:12:07 TRACE
morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify:
{lifecycle=[START_SESSION]}
16366 [main] TRACE
com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells
- beforeProcess:
{_attachment_body=[keyvalues={4Name341784/data&colon;avroUser/1434105492587/Put/vlen=276/seqid=0}],
_attachment_mimetype=[application/java-hbase-result]}
15/06/12 05:12:07 TRACE
morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess:
{_attachment_body=[keyvalues={4Name341784/data&colon;avroUser/1434105492587/Put/vlen=276/seqid=0}],
_attachment_mimetype=[application/java-hbase-result]}
16368 [main] DEBUG com.ngdata.hbaseindexer.indexer.Indexer$RowBasedIndexer
- Indexer _default_ will send to Solr 1 adds and 0 deletes
15/06/12 05:12:07 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_
will send to Solr 1 adds and 0 deletes
dryRun: SolrInputDocument(fields: [id=4Name341784,
_attachment_mimetype=[avro/java+memory], _attachment_body=[{"name":
"4Name341784", "favorite_number": 1, "favorite_color": "Red22"}],
name=[Red22]])
15/06/12 05:12:07 INFO client.ConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x14de71e21730082

What I don't know how I can to say to Solr to avoid the *_attachment_mimetype
*and don't index that field.

I'll type the next problem about Solr and Lily in Cloudera Search. Thanks.

Re: Indexing Avro documents with Lily

whosch — Fri, 12 Jun 2015 12:54:06 GMT

Try to use the sanitizeUnkownSolrFields command per http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#sanitizeUnknownSolrFields

Wolfgang.