Created 04-06-2017 11:57 AM
I am running on CDH5.10.
I am trying to index some csv files that reside on HDFS. I am using the MapReduceIndexerTool and following
the tutorial MapReduce Batch Indexing with Cloudera Search. As advised by an earlier post I created a small subset of my data(only 500 records) and as advised also, I used the --dry-run option to make sure all is ok before the actual indexing takes place. My dry run runs successfully:
--------
5065 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Done. Indexing 1 files in dryrun mode took 0.21276237 secs
5065 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Success. Done. Program took 5.0973063 secs. Goodbye.
----------
Now when I switch to --go-live, I get the following exception:
1217 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 1 files using 1 real mappers into 1 reducers
Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1980-10-27 07:48:24.250937] unknown field 'file_length'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:940)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1095)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:701)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2135)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.solr.hadoop.BatchWriter.runUpdate(BatchWriter.java:135)
at org.apache.solr.hadoop.BatchWriter$Batch.run(BatchWriter.java:90)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:180)
... 9 more
Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1980-10-27 07:48:24.250937] unknown field 'file_length'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:940)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1095)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:701)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2135)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.solr.hadoop.BatchWriter.runUpdate(BatchWriter.java:135)
at org.apache.solr.hadoop.BatchWriter$Batch.run(BatchWriter.java:90)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:180)
... 9 more
Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1980-10-27 07:48:24.250937] unknown field 'file_length'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:940)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1095)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:701)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2135)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.solr.hadoop.BatchWriter.runUpdate(BatchWriter.java:135)
at org.apache.solr.hadoop.BatchWriter$Batch.run(BatchWriter.java:90)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:180)
... 9 more
46523 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool - Job failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper, jobId: job_1491337377676_0015
Help is apprecited.
Created 04-10-2017 08:29 AM
After increasing the memory, I was able to successfully index my data. Thanks.
Created 04-07-2017 01:44 AM
The error tells that your are trying to push a field named "file_length" into Solr.
And Solr doesn't know about that field.
Either you made an error in the schema of the collection and you need to fix the field name ?
Either this field really not exist into Solr and you should not push it (for that you can use the sanitizeUnknownSolrFields function of morphline if you are using morphline).
Created 04-07-2017 06:23 AM
I double checked my schema, the does not contain 'file_length' in it.
I also added the following the statement below to my morphline scripts:
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
Still received the same error.
It is very frustrating, what I am doing should be pretty common and the problem has been solved by many before; not sure what is happening. I was able to use the same schema successfully on Solr 6.2 on my windows machine..
Help is appreciated.
Created 04-07-2017 06:43 AM
Could you share the following things (if you can) ?
- collection name and the schema.xml of the collection
- indexer configuration used (indexer_def.xml ?)
- morphline configuration used
- the command line used for launching the batch indexation
- a small csv sample ?
This might enable me to pin-point the issue (or tell you that all seems fine for me).
regards,
mathieu
Created 04-07-2017 07:13 AM
1) collection name is : party_name
I don't have such a file:>>>> - indexer configuration used (indexer_def.xml ?),
2) schema.xml ( I removed the comments)
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_ls" type="long" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_t" type="text_general" indexed="true" stored="true"/>
<dynamicField name="*_txt" type="text_general" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_bs" type="boolean" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_fs" type="float" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_ds" type="double" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false" />
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
<dynamicField name="*_dts" type="date" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_p" type="location" indexed="true" stored="true"/>
<!-- some trie-coded dynamic fields for faster range queries -->
<dynamicField name="*_ti" type="tint" indexed="true" stored="true"/>
<dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/>
<dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/>
<dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/>
<dynamicField name="*_tdt" type="tdate" indexed="true" stored="true"/>
<dynamicField name="*_c" type="currency" indexed="true" stored="true"/>
<dynamicField name="ignored_*" type="ignored" multiValued="true"/>
<dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="random_*" type="random" />
<uniqueKey>id</uniqueKey>
<field name="county" type="text_general" indexed="false" stored="true"/>
<field name="year" type="int" indexed="false" stored="true"/>
<field name="court_type" type="text_general" indexed="false" stored="true"/>
<field name="seq_num" type="int" indexed="false" stored="true"/>
<field name="party_role" type="text_general" indexed="false" stored="true"/>
<field name="party_num" type="int" indexed="false" stored="true"/>
<field name="party_status" type="text_general" indexed="false" stored="true"/>
<field name="biz_name" type="text_general" indexed="true" stored="true"/>
<field name="prefix" type="text_general" indexed="false" stored="true"/>
<field name="last_name" type="text_general" indexed="true" stored="true"/>
<field name="first_name" type="text_general" indexed="true" stored="true"/>
<field name="middle_name" type="text_general" indexed="true" stored="true"/>
<field name="suffix" type="text_general" indexed="false" stored="true"/>
<field name="in_regards_to" type="string" indexed="false" stored="true"/>
<field name="case_status" type="string" indexed="false" stored="true"/>
<field name="row_of_origin" type="string" indexed="false" stored="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<!-- A Trie based date field for faster date range queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldtype name="binary" class="solr.BinaryField"/>
<fieldType name="random" class="solr.RandomSortField" indexed="true" />
<!-- A text field that only splits on whitespace for exact matching of words -->
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<!-- A text type for English text where stopwords and synonyms are managed using the REST API -->
<fieldType name="managed_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory" managed="english" />
<filter class="solr.ManagedSynonymFilterFactory" managed="english" />
</analyzer>
</fieldType>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<!-- Less flexible matching, but less false matches. Probably not ideal for product names,
but may be good for SKUs. Can insert dashes in the wrong place and still match. -->
<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
possible with WordDelimiterFilter in conjuncton with stemming. -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- Just like text_general except it reverses the characters of
each token, to enable more efficient leading wildcard queries. -->
<fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
<fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldtype>
<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldtype>
<!-- lowercases the entire field value, keeping it as a single token. -->
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="descendent_path" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>
</fieldType>
<!--
Example of using PathHierarchyTokenizerFactory at query time, so
queries for paths match documents at that path, or in ancestor paths
-->
<fieldType name="ancestor_path" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
</analyzer>
</fieldType>
<!-- since fields of this type are by default not stored or indexed,
any data added to them will be ignored outright. -->
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<fieldType name="point" class="solr.PointType" dimension="2" subFieldSuffix="_d"/>
<!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
geo="true" distErrPct="0.025" maxDistErr="0.000009" units="degrees" />
<!-- Spatial rectangle (bounding box) field. It supports most spatial predicates, and has
special relevancy modes: score=overlapRatio|area|area2D (local-param to the query). DocValues is required for
relevancy. -->
<fieldType name="bbox" class="solr.BBoxField"
geo="true" units="degrees" numberType="_bbox_coord" />
<fieldType name="_bbox_coord" class="solr.TrieDoubleField" precisionStep="8" docValues="true" stored="false"/>
<fieldType name="currency" class="solr.CurrencyField" precisionStep="8" defaultCurrency="USD" currencyConfig="currency.xml" />
</schema>
Morphline file
SOLR_LOCATOR : {
# Name of solr collection
collection : party_name
# ZooKeeper ensemble
zkHost : "dwh-mst-dev02.stor.nccourts.org:2181/solr"
# The maximum number of documents to send to Solr per network batch (throughput knob)
# batchSize : 100
}
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{
readCSV {
separator : ","
columns : [id,county,year,court_type,seq_num,party_role,party_num,party_status,biz_name,prefix,last_name,first_name,middle_name,suffix,in_regards_to,case_status,row_of_origin]
ignoreFirstLine : true
trim : true
charset : UTF-8
}
}
{ logDebug { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer.
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
Small csv:
-------------
id,county,year,court_type,seq_num,party_role,party_num,party_status,biz_name,prefix,last_name,first_name,middle_name,suffix,in_regards_to,case_status,row_of_origin
1989-04-11 05:24:16.910647,750,1989,CVM,653,P,1,DISPOSED,BALFOUR GULF,null,null,null,null,null,null,null,T48
2002-02-08 11:42:52.758392,910,2001,CR,119164,P,1,DISPOSED,NC STATE OF,null,null,null,null,null,null,null,T48
1991-09-20 04:21:23.013509,420,1991,CVM,1324,D,1,DISPOSED,null,null,COXUM,BILLY,RAY,null,null,null,T48
2000-03-30 12:19:33.110602,790,2000,CVD,851,P,1,DISPOSED,ROWAN CO DEPT OF SOCIAL SERVICES OBO,null,null,null,null,null,null,null,T48
2016-02-05 16:41:24.154447,100,2016,E,241,E,1,DISPOSED,null,null,MCCULLAGH,SARAH,MARGARET,null,null,null,T48
1993-03-25 23:06:40.379315,520,1993,CVM,321,P,1,DISPOSED,null,null,MARKS,JEFF,null,null,null,null,T48
1997-03-12 08:54:26.444068,250,1989,CRS,7429,P,1,DISPOSED,NC STATE OF,null,null,null,null,null,null,null,T48
2008-12-05 16:01:47.230841,870,2008,CR,164,D,1,DISPOSED,null,null,BURNETTE,NATHAN,BROOKS,null,null,null,T48
1999-06-15 08:37:54.195413,280,1999,CR,2343,P,1,DISPOSED,NC STATE OF,null,null,null,null,null,null,null,T48
1995-10-18 13:23:14.241761,630,1995,CVM,5599,P,1,DISPOSED,TIM'S AUTO SALES,null,null,null,null,null,null,null,T48
1980-10-27 07:48:24.250937,030,1980,CVD,216,P,1,DISPOSED,null,null,HORNE,JENNY,G,null,null,null,T48
1999-09-13 16:57:51.275323,220,1999,M,248,D,1,DISPOSED,null,null,JACKSON,WENDELL,ANTHONY,null,null,null,T48
-------------
The command used:
hadoop --config /etc/hadoop/conf.cloudera.yarn jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j ~/search/log4j.properties --morphline-file ~/search/readCSV.conf --output-dir hdfs://dwh-mst-dev02.stor.nccourts.org:8020/hdfs/data-lake/civil/solr/party-name --verbose --go-live --zk-host dwh-mst-dev02.stor.nccourts.org:2181/solr --collection party_name hdfs://dwh-mst-dev02.stor.nccourts.org:8020/hdfs/data-lake/test/party_search
Thanks for the help.
Created on 04-07-2017 07:57 AM - edited 04-07-2017 07:58 AM
Ok, I could reproduce the issue quickly.
Two things :
- 1 : the MapreduceIndexerTool generates metadata about the file. These "fields" are presented to Solr
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/search_metadata.html
So if you need these fields, you need to add them to the schema of the collection.
- 2 : if you don't need these fields, then you have to delete them before presenting the row to Solr. For this, the function "sanitizeUnknownSolrFields" if the right way to go. But you missplaced it in your morhpline.
It should be inside the command [] and after the readCSV function (of course before the loadSolr function).
Something like this :
commands [ { readCSV { ... } } { sanitizeUnknowSolrFields { ... } } { loadSolr { ... } } ]
Created 04-07-2017 08:33 AM
We passed the file_length error, however I got the following error:
1205 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 22 files using 22 real mappers into 6 reducers
Container [pid=4420,containerID=container_1491337377676_0040_01_000027] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 1.6 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1491337377676_0040_01_000027 :.....
.....
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Any suggestions on how to get around this error. Is there some config changes that I need to do.
Thanks a lot for the help.
Created on 04-07-2017 08:45 AM - edited 04-07-2017 08:46 AM
At least one of the containers was killed by your application master because it requested too much memory.
You need to identify if it was a mapper or a reducer.
And then, you will need to tweak yarn configuration a little to increase the memory available for mapper and/or reducer.
1Go is rather "small" in my opinion.
regards,
mathieu
Created 04-10-2017 08:29 AM
After increasing the memory, I was able to successfully index my data. Thanks.
Created 04-08-2019 11:03 AM