- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Indexing Avro documents with Lily
- Labels:
-
Apache HBase
-
Apache Solr
Created on ‎06-12-2015 01:39 AM - edited ‎09-16-2022 02:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to use an tutorial from Cloudera. (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_hbase_batch_inde...)
I have a code to insert objects in Avro format in HBase and I want to insert them to Solr but I don't get anything.
I have been taking a look to the logs:
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]} 15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={0Name178721/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]} 15/06/12 00:45:00 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 0 adds and 0 deletes 15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]} 15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={1Name134339/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}
So, I'm reaing them but I don't know why it isn't indexed anything in Solr. I guess that my morphline.conf is wrong.
morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**", "com.ngdata.**"] commands : [ { extractHBaseCells { mappings : [ { inputColumn : "data:avroUser" outputField : "_attachment_body" type : "byte[]" source : value } ] } } #for avro use with type : "byte[]" in extractHBaseCells mapping above { readAvroContainer {} } { extractAvroPaths { flatten : true paths : { name : /name } } } { logTrace { format : "output record: {}", args : ["@{}"] } } ] } ]
I wasn't sure if I had to have an "_attachment_body" field in Solr, but it seems that it isn't necessary, so I guess that readAvroContainer or extractAvroPaths are wrong. I have a "name" field in Solr and my avroUser has a "name" field as well.
{"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Created ‎06-12-2015 05:54 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wolfgang.
Created ‎06-12-2015 03:50 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In any case, to automatically print diagnostic information such as the content of records as they pass through the morphline commands, consider enabling TRACE log level, for example by adding the following line to your log4j.properties file:
log4j.logger.org.kitesdk.morphline=TRACE
See http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#logTrace
This will also print which command failed where.
BTW, questions specific to Cloudera Search are best directed to search-user@cloudera.org via http://groups.google.com/a/cloudera.org/group/search-user
Wolfgang
Created on ‎06-12-2015 05:14 AM - edited ‎06-12-2015 05:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I was checking the generating of the avro and I had something wrong
and the avro objects were empty, just the schema. I fixed it and it seems
that I skiped that error.
I have used the TRACE level to see what it's happening and reviewed the log
for the mapReduce again and I got this error when it tries to index an
document to Solr
2015-06-12 05:06:49,843 INFO [IPC Server handler 10 on 45052] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1434101650719_0008_r_000000_3: Error: java.io.IOException: Batch Write Failure at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239) at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181) at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:290) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) *Caused by: org.apache.solr.common.SolrException: ERROR: [doc=0Name115457] unknown field '_attachment_mimetype'* at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at .....
I have been reading
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-ap...
where it talks about the field *_attachment_mimetype. * Why is it trying to
index this field to Solr?
I executed the configuration as well with:
hadoop --config /etc/hadoop/conf jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf /etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m' --hbase-indexer-file /home/cloudera/morphline-hbase-mapper.xml --zk-host 127.0.0.1/solr --collection hbase-collection1 --dry-run --log4j /home/cloudera/log4j.properties
And it looks that it works fine.
dryRun: SolrInputDocument(fields: [id=4Name249228, *_attachment_mimetype=[avro/java+memory]*, _attachment_body=[{"name": "4Name249228", "favorite_number": 41, "favorite_color": "Red27"}], name=[Red27]]) 16366 [main] TRACE com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells - beforeNotify: {lifecycle=[START_SESSION]} 15/06/12 05:12:07 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]} 16366 [main] TRACE com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells - beforeProcess: {_attachment_body=[keyvalues={4Name341784/data:avroUser/1434105492587/Put/vlen=276/seqid=0}], _attachment_mimetype=[application/java-hbase-result]} 15/06/12 05:12:07 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={4Name341784/data:avroUser/1434105492587/Put/vlen=276/seqid=0}], _attachment_mimetype=[application/java-hbase-result]} 16368 [main] DEBUG com.ngdata.hbaseindexer.indexer.Indexer$RowBasedIndexer - Indexer _default_ will send to Solr 1 adds and 0 deletes 15/06/12 05:12:07 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 1 adds and 0 deletes dryRun: SolrInputDocument(fields: [id=4Name341784, _attachment_mimetype=[avro/java+memory], _attachment_body=[{"name": "4Name341784", "favorite_number": 1, "favorite_color": "Red22"}], name=[Red22]]) 15/06/12 05:12:07 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x14de71e21730082
What I don't know how I can to say to Solr to avoid the *_attachment_mimetype
*and don't index that field.
I'll type the next problem about Solr and Lily in Cloudera Search. Thanks.
Created ‎06-12-2015 05:54 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wolfgang.
