<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Indexing Avro documents with Lily in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28487#M6270</link>
    <description>Maybe readAvroContainer fails because your avro data isn't contained in an avro container, in which case use readAvro command instead of readAvroContainer.&lt;BR /&gt;&lt;BR /&gt;In any case, to automatically print diagnostic information such as the content of records as they pass through the morphline commands, consider enabling TRACE log level, for example by adding the following line to your log4j.properties file:&lt;BR /&gt;&lt;BR /&gt;log4j.logger.org.kitesdk.morphline=TRACE&lt;BR /&gt;&lt;BR /&gt;See &lt;A target="_blank" href="http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#logTrace"&gt;http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#logTrace&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;This will also print which command failed where.&lt;BR /&gt;&lt;BR /&gt;BTW, questions specific to Cloudera Search are best directed to search-user@cloudera.org via &lt;A target="_blank" href="http://groups.google.com/a/cloudera.org/group/search-user"&gt;http://groups.google.com/a/cloudera.org/group/search-user&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Wolfgang&lt;BR /&gt;&lt;BR /&gt;</description>
    <pubDate>Fri, 12 Jun 2015 10:50:06 GMT</pubDate>
    <dc:creator>whosch</dc:creator>
    <dc:date>2015-06-12T10:50:06Z</dc:date>
    <item>
      <title>Indexing Avro documents with Lily</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28483#M6269</link>
      <description>&lt;P&gt;I'm trying to use an tutorial from Cloudera. (&lt;A href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_hbase_batch_indexer.html" target="_blank" rel="nofollow"&gt;http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_hbase_batch_indexer.html&lt;/A&gt;)&lt;/P&gt;&lt;P&gt;I have a code to insert objects in Avro format in HBase and I want to insert them to Solr but I don't get anything.&lt;/P&gt;&lt;P&gt;I have been taking a look to the logs:&lt;/P&gt;&lt;PRE&gt;15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={0Name178721/data&amp;amp;colon;avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}
15/06/12 00:45:00 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 0 adds and 0 deletes
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={1Name134339/data&amp;amp;colon;avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}&lt;/PRE&gt;&lt;P&gt;So, I'm reaing them but I don't know why it isn't indexed anything in Solr. I guess that my morphline.conf is wrong.&lt;/P&gt;&lt;PRE&gt;morphlines : [
{
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**", "com.ngdata.**"]
    commands : [
      {
         extractHBaseCells {
          mappings : [
            {
             inputColumn : "data&amp;amp;colon;avroUser"
              outputField : "_attachment_body"
              type : "byte[]"
              source : value
            }
         ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      { readAvroContainer {} }
      {
        extractAvroPaths {
          flatten : true
          paths : {
            name : /name
          }
        }
      }
      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
 }
]&lt;/PRE&gt;&lt;P&gt;I wasn't sure if I had to have an "_attachment_body" field in Solr, but it seems that it isn't necessary, so I guess that readAvroContainer or extractAvroPaths are wrong. I have a "name" field in Solr and my avroUser has a "name" field as well.&lt;/P&gt;&lt;PRE&gt;{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}&lt;/PRE&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:31:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28483#M6269</guid>
      <dc:creator>ortizg</dc:creator>
      <dc:date>2022-09-16T09:31:23Z</dc:date>
    </item>
    <item>
      <title>Re: Indexing Avro documents with Lily</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28487#M6270</link>
      <description>Maybe readAvroContainer fails because your avro data isn't contained in an avro container, in which case use readAvro command instead of readAvroContainer.&lt;BR /&gt;&lt;BR /&gt;In any case, to automatically print diagnostic information such as the content of records as they pass through the morphline commands, consider enabling TRACE log level, for example by adding the following line to your log4j.properties file:&lt;BR /&gt;&lt;BR /&gt;log4j.logger.org.kitesdk.morphline=TRACE&lt;BR /&gt;&lt;BR /&gt;See &lt;A target="_blank" href="http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#logTrace"&gt;http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#logTrace&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;This will also print which command failed where.&lt;BR /&gt;&lt;BR /&gt;BTW, questions specific to Cloudera Search are best directed to search-user@cloudera.org via &lt;A target="_blank" href="http://groups.google.com/a/cloudera.org/group/search-user"&gt;http://groups.google.com/a/cloudera.org/group/search-user&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Wolfgang&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 12 Jun 2015 10:50:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28487#M6270</guid>
      <dc:creator>whosch</dc:creator>
      <dc:date>2015-06-12T10:50:06Z</dc:date>
    </item>
    <item>
      <title>Re: Indexing Avro documents with Lily</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28489#M6271</link>
      <description>&lt;P&gt;Thanks, I was checking the generating of the avro and I had something wrong&lt;BR /&gt;and the avro objects were empty, just the schema. I fixed it and it seems&lt;BR /&gt;that I skiped that error.&lt;BR /&gt;&lt;BR /&gt;I have used the TRACE level to see what it's happening and reviewed the log&lt;BR /&gt;for the mapReduce again and I got this error when it tries to index an&lt;BR /&gt;document to Solr&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;2015-06-12 05:06:49,843 INFO [IPC Server handler 10 on 45052]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from
attempt_1434101650719_0008_r_000000_3: Error: java.io.IOException: Batch
Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:290)
at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
*Caused by: org.apache.solr.common.SolrException: ERROR: [doc=0Name115457]
unknown field '_attachment_mimetype'*
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at .....&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;I have been reading&lt;BR /&gt;&lt;A target="_blank" href="http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/"&gt;http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/&lt;/A&gt;&lt;BR /&gt;where it talks about the field *_attachment_mimetype. * Why is it trying to&lt;BR /&gt;index this field to Solr?&lt;BR /&gt;&lt;BR /&gt;I executed the configuration as well with:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;hadoop --config /etc/hadoop/conf jar
/usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf
/etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m'
--hbase-indexer-file /home/cloudera/morphline-hbase-mapper.xml --zk-host
127.0.0.1/solr --collection hbase-collection1 --dry-run --log4j
/home/cloudera/log4j.properties&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;And it looks that it works fine.&lt;/P&gt;&lt;PRE&gt;dryRun: SolrInputDocument(fields: [id=4Name249228,
*_attachment_mimetype=[avro/java+memory]*, _attachment_body=[{"name":
"4Name249228", "favorite_number": 41, "favorite_color": "Red27"}],
name=[Red27]])
16366 [main] TRACE
com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells
- beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 05:12:07 TRACE
morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify:
{lifecycle=[START_SESSION]}
16366 [main] TRACE
com.ngdata.hbaseindexer.morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells
- beforeProcess:
{_attachment_body=[keyvalues={4Name341784/data&amp;amp;colon;avroUser/1434105492587/Put/vlen=276/seqid=0}],
_attachment_mimetype=[application/java-hbase-result]}
15/06/12 05:12:07 TRACE
morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess:
{_attachment_body=[keyvalues={4Name341784/data&amp;amp;colon;avroUser/1434105492587/Put/vlen=276/seqid=0}],
_attachment_mimetype=[application/java-hbase-result]}
16368 [main] DEBUG com.ngdata.hbaseindexer.indexer.Indexer$RowBasedIndexer
- Indexer _default_ will send to Solr 1 adds and 0 deletes
15/06/12 05:12:07 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_
will send to Solr 1 adds and 0 deletes
dryRun: SolrInputDocument(fields: [id=4Name341784,
_attachment_mimetype=[avro/java+memory], _attachment_body=[{"name":
"4Name341784", "favorite_number": 1, "favorite_color": "Red22"}],
name=[Red22]])
15/06/12 05:12:07 INFO client.ConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x14de71e21730082&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;What I don't know how I can to say to Solr to avoid the *_attachment_mimetype&lt;BR /&gt;*and don't index that field.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'll type the next problem about Solr and Lily in Cloudera Search. Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 12:17:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28489#M6271</guid>
      <dc:creator>ortizg</dc:creator>
      <dc:date>2015-06-12T12:17:31Z</dc:date>
    </item>
    <item>
      <title>Re: Indexing Avro documents with Lily</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28491#M6272</link>
      <description>Try to use the sanitizeUnkownSolrFields command per &lt;A target="_blank" href="http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#sanitizeUnknownSolrFields"&gt;http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html#sanitizeUnknownSolrFields&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Wolfgang.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 12 Jun 2015 12:54:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Indexing-Avro-documents-with-Lily/m-p/28491#M6272</guid>
      <dc:creator>whosch</dc:creator>
      <dc:date>2015-06-12T12:54:06Z</dc:date>
    </item>
  </channel>
</rss>

