Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

org.apache.solr.common.SolrException: ERROR: unknown field 'file_length'

avatar
Contributor

I am running on CDH5.10.

I am trying to index some csv files that reside on HDFS.  I am using the MapReduceIndexerTool and following

the tutorial MapReduce Batch Indexing with Cloudera Search.  As advised by an earlier post I created a small subset of my data(only 500 records) and as advised also, I used the --dry-run option to make sure all is ok before the actual indexing takes place.  My dry run runs successfully:

--------

5065 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Done. Indexing 1 files in dryrun mode took 0.21276237 secs
5065 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Success. Done. Program took 5.0973063 secs. Goodbye.

----------

Now when I switch to --go-live, I get the following exception:

1217 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 1 files using 1 real mappers into 1 reducers
Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1980-10-27 07:48:24.250937] unknown field 'file_length'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:940)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1095)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:701)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2135)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.solr.hadoop.BatchWriter.runUpdate(BatchWriter.java:135)
at org.apache.solr.hadoop.BatchWriter$Batch.run(BatchWriter.java:90)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:180)
... 9 more

Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1980-10-27 07:48:24.250937] unknown field 'file_length'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:940)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1095)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:701)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2135)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.solr.hadoop.BatchWriter.runUpdate(BatchWriter.java:135)
at org.apache.solr.hadoop.BatchWriter$Batch.run(BatchWriter.java:90)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:180)
... 9 more

Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1980-10-27 07:48:24.250937] unknown field 'file_length'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:940)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1095)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:701)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2135)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.solr.hadoop.BatchWriter.runUpdate(BatchWriter.java:135)
at org.apache.solr.hadoop.BatchWriter$Batch.run(BatchWriter.java:90)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:180)
... 9 more

46523 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool - Job failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper, jobId: job_1491337377676_0015

Help is apprecited.

 

1 ACCEPTED SOLUTION

avatar
Contributor

After increasing the memory, I was able to successfully index my data.  Thanks.

View solution in original post

9 REPLIES 9

avatar
Super Collaborator

The error tells that your are trying to push a field named "file_length" into Solr.

And Solr doesn't know about that field.

 

Either you made an error in the schema of the collection and you need to fix the field name ?

Either this field really not exist into Solr and you should not push it (for that you can use the sanitizeUnknownSolrFields function of morphline if you are using morphline).

 

avatar
Contributor

 I double checked my schema, the does not contain 'file_length' in it.

I also added the following the statement below to my morphline scripts:

 

sanitizeUnknownSolrFields {
  solrLocator : ${SOLR_LOCATOR}
}

Still received the same error. 

 

It is very frustrating, what I am doing should be pretty common and the problem has been solved by many before; not sure what is happening.  I was able to use the same schema successfully on Solr 6.2 on my windows machine..

Help is appreciated.

avatar
Super Collaborator

Could you share the following things (if you can) ?
- collection name and the schema.xml of the collection
- indexer configuration used (indexer_def.xml ?)
- morphline configuration used
- the command line used for launching the batch indexation
- a small csv sample ?

This might enable me to pin-point the issue (or tell you that all seems fine for me).

regards,
mathieu

avatar
Contributor

1) collection name is : party_name

I don't have such a file:>>>> - indexer configuration used (indexer_def.xml ?),

2) schema.xml ( I removed the comments)

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.5">

   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="_root_" type="string" indexed="true" stored="false"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

   <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
   <dynamicField name="*_is" type="int"    indexed="true"  stored="true"  multiValued="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true" />
   <dynamicField name="*_ss" type="string"  indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
   <dynamicField name="*_ls" type="long"   indexed="true"  stored="true"  multiValued="true"/>
   <dynamicField name="*_t"  type="text_general"    indexed="true"  stored="true"/>
   <dynamicField name="*_txt" type="text_general"   indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_en"  type="text_en"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true" stored="true"/>
   <dynamicField name="*_bs" type="boolean" indexed="true" stored="true"  multiValued="true"/>
   <dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
   <dynamicField name="*_fs" type="float"  indexed="true"  stored="true"  multiValued="true"/>
   <dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
   <dynamicField name="*_ds" type="double" indexed="true"  stored="true"  multiValued="true"/>
   <dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false" />

   <dynamicField name="*_dt"  type="date"    indexed="true"  stored="true"/>
   <dynamicField name="*_dts" type="date"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

   <!-- some trie-coded dynamic fields for faster range queries -->
   <dynamicField name="*_ti" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tl" type="tlong"   indexed="true"  stored="true"/>
   <dynamicField name="*_tf" type="tfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_td" type="tdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_tdt" type="tdate"  indexed="true"  stored="true"/>

   <dynamicField name="*_c"   type="currency" indexed="true"  stored="true"/>

   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

   <dynamicField name="random_*" type="random" />

<uniqueKey>id</uniqueKey>
<field name="county" type="text_general" indexed="false" stored="true"/>
<field name="year" type="int" indexed="false" stored="true"/>
<field name="court_type" type="text_general" indexed="false" stored="true"/>
<field name="seq_num" type="int" indexed="false" stored="true"/>
<field name="party_role" type="text_general" indexed="false" stored="true"/>
<field name="party_num" type="int" indexed="false" stored="true"/>
<field name="party_status" type="text_general" indexed="false" stored="true"/>
<field name="biz_name" type="text_general" indexed="true" stored="true"/>
<field name="prefix" type="text_general" indexed="false" stored="true"/>
<field name="last_name" type="text_general" indexed="true" stored="true"/>
<field name="first_name" type="text_general" indexed="true" stored="true"/>
<field name="middle_name" type="text_general" indexed="true" stored="true"/>
<field name="suffix" type="text_general" indexed="false" stored="true"/>
<field name="in_regards_to" type="string" indexed="false" stored="true"/>
<field name="case_status" type="string" indexed="false" stored="true"/>
<field name="row_of_origin" type="string" indexed="false" stored="true"/>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>

    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>

    <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>

    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>

    <!-- A Trie based date field for faster date range queries and date faceting. -->
    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>


    <!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
    <fieldtype name="binary" class="solr.BinaryField"/>


    <fieldType name="random" class="solr.RandomSortField" indexed="true" />


    <!-- A text field that only splits on whitespace for exact matching of words -->
    <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

    <!-- A text type for English text where stopwords and synonyms are managed using the REST API -->
    <fieldType name="managed_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ManagedStopFilterFactory" managed="english" />
        <filter class="solr.ManagedSynonymFilterFactory" managed="english" />
      </analyzer>
    </fieldType>

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

    <!-- Less flexible matching, but less false matches.  Probably not ideal for product names,
         but may be good for SKUs.  Can insert dashes in the wrong place and still match. -->
    <fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
             possible with WordDelimiterFilter in conjuncton with stemming. -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    <!-- Just like text_general except it reverses the characters of
     each token, to enable more efficient leading wildcard queries. -->
    <fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
           maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />
      </analyzer>
    </fieldType>
    
    <fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" >
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
      </analyzer>
    </fieldtype>

    <fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
      </analyzer>
    </fieldtype>

    <!-- lowercases the entire field value, keeping it as a single token.  -->
    <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>
    <fieldType name="descendent_path" class="solr.TextField">
      <analyzer type="index">
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
      </analyzer>
      <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory" />
      </analyzer>
    </fieldType>
    <!--
      Example of using PathHierarchyTokenizerFactory at query time, so
      queries for paths match documents at that path, or in ancestor paths
    -->
    <fieldType name="ancestor_path" class="solr.TextField">
      <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory" />
      </analyzer>
      <analyzer type="query">
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
      </analyzer>
    </fieldType>

    <!-- since fields of this type are by default not stored or indexed,
         any data added to them will be ignored outright.  -->
    <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

    <fieldType name="point" class="solr.PointType" dimension="2" subFieldSuffix="_d"/>

    <!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->
    <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

    <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
        geo="true" distErrPct="0.025" maxDistErr="0.000009" units="degrees" />

    <!-- Spatial rectangle (bounding box) field. It supports most spatial predicates, and has
     special relevancy modes: score=overlapRatio|area|area2D (local-param to the query).  DocValues is required for
     relevancy. -->
    <fieldType name="bbox" class="solr.BBoxField"
        geo="true" units="degrees" numberType="_bbox_coord" />
    <fieldType name="_bbox_coord" class="solr.TrieDoubleField" precisionStep="8" docValues="true" stored="false"/>

    <fieldType name="currency" class="solr.CurrencyField" precisionStep="8" defaultCurrency="USD" currencyConfig="currency.xml" />
</schema>

Morphline file

SOLR_LOCATOR : {
  # Name of solr collection
  collection : party_name
             
  # ZooKeeper ensemble
  zkHost : "dwh-mst-dev02.stor.nccourts.org:2181/solr"
 
  # The maximum number of documents to send to Solr per network batch (throughput knob)
  # batchSize : 100
}

sanitizeUnknownSolrFields {
  solrLocator : ${SOLR_LOCATOR}
}

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
    
    commands : [                    
      {
        readCSV {
          separator : ","
          columns : [id,county,year,court_type,seq_num,party_role,party_num,party_status,biz_name,prefix,last_name,first_name,middle_name,suffix,in_regards_to,case_status,row_of_origin]
          ignoreFirstLine : true
          trim : true
          charset : UTF-8
        }
      }
      { logDebug { format : "output record: {}", args : ["@{}"] } }   
      
      # load the record into a Solr server or MapReduce Reducer.
      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
    ]
  }
]

Small csv:

-------------

id,county,year,court_type,seq_num,party_role,party_num,party_status,biz_name,prefix,last_name,first_name,middle_name,suffix,in_regards_to,case_status,row_of_origin
1989-04-11 05:24:16.910647,750,1989,CVM,653,P,1,DISPOSED,BALFOUR GULF,null,null,null,null,null,null,null,T48
2002-02-08 11:42:52.758392,910,2001,CR,119164,P,1,DISPOSED,NC STATE OF,null,null,null,null,null,null,null,T48
1991-09-20 04:21:23.013509,420,1991,CVM,1324,D,1,DISPOSED,null,null,COXUM,BILLY,RAY,null,null,null,T48
2000-03-30 12:19:33.110602,790,2000,CVD,851,P,1,DISPOSED,ROWAN CO DEPT OF SOCIAL SERVICES OBO,null,null,null,null,null,null,null,T48
2016-02-05 16:41:24.154447,100,2016,E,241,E,1,DISPOSED,null,null,MCCULLAGH,SARAH,MARGARET,null,null,null,T48
1993-03-25 23:06:40.379315,520,1993,CVM,321,P,1,DISPOSED,null,null,MARKS,JEFF,null,null,null,null,T48
1997-03-12 08:54:26.444068,250,1989,CRS,7429,P,1,DISPOSED,NC STATE OF,null,null,null,null,null,null,null,T48
2008-12-05 16:01:47.230841,870,2008,CR,164,D,1,DISPOSED,null,null,BURNETTE,NATHAN,BROOKS,null,null,null,T48
1999-06-15 08:37:54.195413,280,1999,CR,2343,P,1,DISPOSED,NC STATE OF,null,null,null,null,null,null,null,T48
1995-10-18 13:23:14.241761,630,1995,CVM,5599,P,1,DISPOSED,TIM'S AUTO SALES,null,null,null,null,null,null,null,T48
1980-10-27 07:48:24.250937,030,1980,CVD,216,P,1,DISPOSED,null,null,HORNE,JENNY,G,null,null,null,T48
1999-09-13 16:57:51.275323,220,1999,M,248,D,1,DISPOSED,null,null,JACKSON,WENDELL,ANTHONY,null,null,null,T48

-------------

The command used:

hadoop --config /etc/hadoop/conf.cloudera.yarn jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j ~/search/log4j.properties --morphline-file ~/search/readCSV.conf --output-dir hdfs://dwh-mst-dev02.stor.nccourts.org:8020/hdfs/data-lake/civil/solr/party-name --verbose --go-live --zk-host dwh-mst-dev02.stor.nccourts.org:2181/solr --collection party_name hdfs://dwh-mst-dev02.stor.nccourts.org:8020/hdfs/data-lake/test/party_search

 

Thanks for the help.

avatar
Super Collaborator

Ok, I could reproduce the issue quickly.

 

Two things :

- 1 : the MapreduceIndexerTool generates metadata about the file. These "fields" are presented to Solr

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/search_metadata.html

 

So if you need these fields, you need to add them to the schema of the collection.

 

- 2 : if you don't need these fields, then you have to delete them before presenting the row to Solr. For this, the function "sanitizeUnknownSolrFields" if the right way to go. But you missplaced it in your morhpline.

 

It should be inside the command [] and after the readCSV function (of course before the loadSolr function).

 

Something like this :

commands [
  { 
     readCSV {
     ...
     }
  }
  {
     sanitizeUnknowSolrFields {
     ...
    }
  }
  {
    loadSolr {
    ...
    }
  }
] 

 

avatar
Contributor

We passed the file_length error, however I got the following error:

1205 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 22 files using 22 real mappers into 6 reducers
Container [pid=4420,containerID=container_1491337377676_0040_01_000027] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 1.6 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1491337377676_0040_01_000027 :.....

.....

 

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

 

Any suggestions on how to get around this error.  Is there some config changes that I need to do. 

 

Thanks a lot for the help.

avatar
Super Collaborator

At least one of the containers was killed by your application master because it requested too much memory.

 

You need to identify if it was a mapper or a reducer.

And then, you will need to tweak yarn configuration a little to increase the memory available for mapper and/or reducer.

 

1Go is rather "small" in my opinion.

 

regards,

mathieu

avatar
Contributor

After increasing the memory, I was able to successfully index my data.  Thanks.

avatar
New Contributor
i am also facing same error , may i know where you increased the memory