Created on 06-12-2015 01:39 AM - edited 09-16-2022 02:31 AM
I'm trying to use an tutorial from Cloudera. (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_hbase_batch_inde...)
I have a code to insert objects in Avro format in HBase and I want to insert them to Solr but I don't get anything.
I have been taking a look to the logs:
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]} 15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={0Name178721/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]} 15/06/12 00:45:00 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 0 adds and 0 deletes 15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]} 15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={1Name134339/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}
So, I'm reaing them but I don't know why it isn't indexed anything in Solr. I guess that my morphline.conf is wrong.
morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**", "com.ngdata.**"] commands : [ { extractHBaseCells { mappings : [ { inputColumn : "data:avroUser" outputField : "_attachment_body" type : "byte[]" source : value } ] } } #for avro use with type : "byte[]" in extractHBaseCells mapping above { readAvroContainer {} } { extractAvroPaths { flatten : true paths : { name : /name } } } { logTrace { format : "output record: {}", args : ["@{}"] } } ] } ]
I wasn't sure if I had to have an "_attachment_body" field in Solr, but it seems that it isn't necessary, so I guess that readAvroContainer or extractAvroPaths are wrong. I have a "name" field in Solr and my avroUser has a "name" field as well.
{"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Created 06-12-2015 05:54 AM
Created 06-12-2015 03:50 AM
Created on 06-12-2015 05:14 AM - edited 06-12-2015 05:17 AM
Thanks, I was checking the generating of the avro and I had something wrong
and the avro objects were empty, just the schema. I fixed it and it seems
that I skiped that error.
I have used the TRACE level to see what it's happening and reviewed the log
for the mapReduce again and I got this error when it tries to index an
document to Solr
2015-06-12 05:06:49,843 INFO [IPC Server handler 10 on 45052] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1434101650719_0008_r_000000_3: Error: java.io.IOException: Batch Write Failure at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239) at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181) at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:290) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) *Caused by: org.apache.solr.common.SolrException: ERROR: [doc=0Name115457] unknown field '_attachment_mimetype'* at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at .....
I have been reading
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-ap...
where it talks about the field *_attachment_mimetype. * Why is it trying to
index this field to Solr?
I executed the configuration as well with:
hadoop --config /etc/hadoop/conf jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf /etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m' --hbase-indexer-file /home/cloudera/morphline-hbase-mapper.xml --zk-host 127.0.0.1/solr --collection hbase-collection1 --dry-run --log4j /home/cloudera/log4j.properties
And it looks that it works fine.
dryRun: SolrInputDocument(fields: [id=4Name249228, *_attachment_mimetype=[avro/java+memory]*, _attachment_body=[{"name": "4Name249228", "favorite_number": 41, "favorite_color": "Red27"}], name=[Red27]]) 16366 [main] TRACE com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells - beforeNotify: {lifecycle=[START_SESSION]} 15/06/12 05:12:07 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]} 16366 [main] TRACE com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells - beforeProcess: {_attachment_body=[keyvalues={4Name341784/data:avroUser/1434105492587/Put/vlen=276/seqid=0}], _attachment_mimetype=[application/java-hbase-result]} 15/06/12 05:12:07 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={4Name341784/data:avroUser/1434105492587/Put/vlen=276/seqid=0}], _attachment_mimetype=[application/java-hbase-result]} 16368 [main] DEBUG com.ngdata.hbaseindexer.indexer.Indexer$RowBasedIndexer - Indexer _default_ will send to Solr 1 adds and 0 deletes 15/06/12 05:12:07 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 1 adds and 0 deletes dryRun: SolrInputDocument(fields: [id=4Name341784, _attachment_mimetype=[avro/java+memory], _attachment_body=[{"name": "4Name341784", "favorite_number": 1, "favorite_color": "Red22"}], name=[Red22]]) 15/06/12 05:12:07 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x14de71e21730082
What I don't know how I can to say to Solr to avoid the *_attachment_mimetype
*and don't index that field.
I'll type the next problem about Solr and Lily in Cloudera Search. Thanks.
Created 06-12-2015 05:54 AM