Support Questions

Find answers, ask questions, and share your expertise

Indexing Avro documents with Lily

avatar
Explorer

I'm trying to use an tutorial from Cloudera. (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_hbase_batch_inde...)

I have a code to insert objects in Avro format in HBase and I want to insert them to Solr but I don't get anything.

I have been taking a look to the logs:

15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={0Name178721/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}
15/06/12 00:45:00 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 0 adds and 0 deletes
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={1Name134339/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}

So, I'm reaing them but I don't know why it isn't indexed anything in Solr. I guess that my morphline.conf is wrong.

morphlines : [
{
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**", "com.ngdata.**"]
    commands : [
      {
         extractHBaseCells {
          mappings : [
            {
             inputColumn : "data:avroUser"
              outputField : "_attachment_body"
              type : "byte[]"
              source : value
            }
         ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      { readAvroContainer {} }
      {
        extractAvroPaths {
          flatten : true
          paths : {
            name : /name
          }
        }
      }
      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
 }
]

I wasn't sure if I had to have an "_attachment_body" field in Solr, but it seems that it isn't necessary, so I guess that readAvroContainer or extractAvroPaths are wrong. I have a "name" field in Solr and my avroUser has a "name" field as well.

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}
1 ACCEPTED SOLUTION

avatar
Super Collaborator
3 REPLIES 3

avatar
Super Collaborator
Maybe readAvroContainer fails because your avro data isn't contained in an avro container, in which case use readAvro command instead of readAvroContainer.

In any case, to automatically print diagnostic information such as the content of records as they pass through the morphline commands, consider enabling TRACE log level, for example by adding the following line to your log4j.properties file:

log4j.logger.org.kitesdk.morphline=TRACE

See http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#logTrace

This will also print which command failed where.

BTW, questions specific to Cloudera Search are best directed to search-user@cloudera.org via http://groups.google.com/a/cloudera.org/group/search-user

Wolfgang

avatar
Explorer

Thanks, I was checking the generating of the avro and I had something wrong
and the avro objects were empty, just the schema. I fixed it and it seems
that I skiped that error.

I have used the TRACE level to see what it's happening and reviewed the log
for the mapReduce again and I got this error when it tries to index an
document to Solr

2015-06-12 05:06:49,843 INFO [IPC Server handler 10 on 45052]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from
attempt_1434101650719_0008_r_000000_3: Error: java.io.IOException: Batch
Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:290)
at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
*Caused by: org.apache.solr.common.SolrException: ERROR: [doc=0Name115457]
unknown field '_attachment_mimetype'*
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at .....



I have been reading
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-ap...
where it talks about the field *_attachment_mimetype. * Why is it trying to
index this field to Solr?

I executed the configuration as well with:

 

hadoop --config /etc/hadoop/conf jar
/usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf
/etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m'
--hbase-indexer-file /home/cloudera/morphline-hbase-mapper.xml --zk-host
127.0.0.1/solr --collection hbase-collection1 --dry-run --log4j
/home/cloudera/log4j.properties



And it looks that it works fine.

dryRun: SolrInputDocument(fields: [id=4Name249228,
*_attachment_mimetype=[avro/java+memory]*, _attachment_body=[{"name":
"4Name249228", "favorite_number": 41, "favorite_color": "Red27"}],
name=[Red27]])
16366 [main] TRACE
com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells
- beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 05:12:07 TRACE
morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify:
{lifecycle=[START_SESSION]}
16366 [main] TRACE
com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells
- beforeProcess:
{_attachment_body=[keyvalues={4Name341784/data:avroUser/1434105492587/Put/vlen=276/seqid=0}],
_attachment_mimetype=[application/java-hbase-result]}
15/06/12 05:12:07 TRACE
morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess:
{_attachment_body=[keyvalues={4Name341784/data:avroUser/1434105492587/Put/vlen=276/seqid=0}],
_attachment_mimetype=[application/java-hbase-result]}
16368 [main] DEBUG com.ngdata.hbaseindexer.indexer.Indexer$RowBasedIndexer
- Indexer _default_ will send to Solr 1 adds and 0 deletes
15/06/12 05:12:07 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_
will send to Solr 1 adds and 0 deletes
dryRun: SolrInputDocument(fields: [id=4Name341784,
_attachment_mimetype=[avro/java+memory], _attachment_body=[{"name":
"4Name341784", "favorite_number": 1, "favorite_color": "Red22"}],
name=[Red22]])
15/06/12 05:12:07 INFO client.ConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x14de71e21730082



What I don't know how I can to say to Solr to avoid the *_attachment_mimetype
*and don't index that field.

 

I'll type the next problem about Solr and Lily in Cloudera Search. Thanks.

avatar
Super Collaborator