Created 03-27-2017 08:26 AM
I followed the Cloudera Quick Start User Guide to create and index my data.
I was able to successfully excute the following steps:
# generate the instance configuration, the copy the schema
1) $ solrctl instancedir --generate $HOME/party_name_config
$ cp schema.xml $HOME/party_name_config/conf
#upload the configuration to ZooKeepoer
$ solrctl instancedir --create party_name_config $HOME/party_name_config/
# create the new collection
$ solrctl collection --create party_name -c party_name_config
Now when I run the following script:
hadoop --config /etc/hadoop/conf.cloudera.hdfs jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j ~/search/log4j.properties --morphline-file ~/search/readCSV.conf --output-dir hdfs://dwh-mst-dev02.stor.nccourts.org:8020/hdfs/data-lake/civil/solr/party-name --verbose --go-live --zk-host dwh-mst-dev02.stor.nccourts.org:2181/solr --collection party_name hdfs://dwh-mst-dev02.stor.nccourts.org:8020/hdfs/data-lake/civil/party_search
I am receiving the following exception:
2726 [Thread-18] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local542592213_0001
java.lang.Exception: org.kitesdk.morphline.api.MorphlineRuntimeException: org.apache.solr.core.SolrResourceNotFoundException: Can't find resource 'solrconfig.xml' in classpath or '/home/iapima/file:/tmp/hadoop-iapima/mapred/local/1490304732115/07193328-e9c3-454c-8523-4a782f9371e4.solr.zip/conf'
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549)
Caused by: org.kitesdk.morphline.api.MorphlineRuntimeException: org.apache.solr.core.SolrResourceNotFoundException: Can't find resource 'solrconfig.xml' in classpath or '/home/iapima/file:/tmp/hadoop-iapima/mapred/local/1490304732115/07193328-e9c3-454c-8523-4a782f9371e4.solr.zip/conf'
at org.kitesdk.morphline.solr.SolrLocator.getIndexSchema(SolrLocator.java:209)
at org.apache.solr.hadoop.morphline.MorphlineMapRunner.<init>(MorphlineMapRunner.java:141)
at org.apache.solr.hadoop.morphline.MorphlineMapper.setup(MorphlineMapper.java:75)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.core.SolrResourceNotFoundException: Can't find resource 'solrconfig.xml' in classpath or '/home/iapima/file:/tmp/hadoop-iapima/mapred/local/1490304732115/07193328-e9c3-454c-8523-4a782f9371e4.solr.zip/conf'
at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:362)
at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:308)
at org.apache.solr.core.Config.<init>(Config.java:117)
at org.apache.solr.core.Config.<init>(Config.java:87)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:167)
at org.kitesdk.morphline.solr.SolrLocator.getIndexSchema(SolrLocator.java:201)
... 11 more
When I checked the party_name_config which was created in step 1, I checked under the sub-dir named conf, and the solrconfig.xml does exist.
I am running on CDH 5.10
Help is appreciated. Thanks
Created on 03-27-2017 06:50 PM - edited 03-27-2017 06:54 PM
Looks like the error is stating it is trying to find the solrconfig locally and not able to find it.
1. I noticed you are passing hadoop --config /etc/hadoop/conf.cloudera.hdfs ->
try to pass hadoop --config /etc/hadoop/conf.cloudera.yarn as MRIndexer tool is a map reduce job and need that configuration. Make sure to have yarn and solr gateway on the node from where you are trying to run this from.
2. Can you check under zookeeper if you have all configs placed?
Login to dwh-mst-dev02.stor.nccourts.org and do
zookeeper-client
ls /solr
ls /solr/configs
ls /solr/configs/party_name_config
ls /solr/configs/party_name_config/solrconfig.xml
Make sure all this is present under /solr in zookeeper
3. Can you paste the content of ~/search/readCSV.conf? Make sure you have zkHost: dwh-mst-dev02.stor.nccourts.org:2181/solr set in your morphline config.
4. Do you have $HOME set up?
solrctl instancedir --create party_name_config $HOME/party_name_config/
5. Here is the Cloudera example for MRIT, please have a look at this
Created 03-28-2017 07:32 AM
I made change suggested to point to /etc/hadoop/conf.cloudera.yarn as suggested. That took care of the earlier
error. When I reran the script, I got the error below.
-------------
Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1966-05-19 10:36:59.373733] unknown field 'file_length'
-----------
It seems not to like the id field which is a string representation of a timestamp.
Here is an excerpt of the schema that I am including:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<uniqueKey>id</uniqueKey>
<field name="county" type="text_general" indexed="false" stored="true"/>
<field name="year" type="int" indexed="false" stored="true"/>
<field name="court_type" type="text_general" indexed="false" stored="true"/>
<field name="seq_num" type="int" indexed="false" stored="true"/>
<field name="role" type="text_general" indexed="false" stored="true"/>
<field name="num" type="int" indexed="false" stored="true"/>
<field name="stat" type="text_general" indexed="false" stored="true"/>
<field name="biz_name" type="text_general" indexed="true" stored="true"/>
--------------------
And here is an excerpt of my files to be indexed:
id,county,year,court_type,seq_num,party_role,party_num,party_status,biz_name,prefix,last_name,first_name,middle_name,suffix,in_regards_to,case_status,row_of_origin
1994-11-03 12:15:32.12172,180,1994,CVM,558,P,1,DISPOSED,WINDSOR ARMS HOUSING LTD PTNSHP,null,null,null,null,null,null,null,T48
1999-04-16 14:28:37.009778,000,1999,CVD,862,P,1,null,null,null,CRITZER,KAREN,YVONNE,null,null,null,T46
-----------------------
Here is the readCSV.conf
SOLR_LOCATOR : {
# Name of solr collection
collection : party_name
# ZooKeeper ensemble
zkHost : "dwh-mst-dev02.stor.nccourts.org:2181/solr"
# The maximum number of documents to send to Solr per network batch (throughput knob)
# batchSize : 100
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{
readCSV {
separator : ","
columns : [id,county,year,court_type,seq_num,party_role,party_num,party_status,biz_name,prefix,last_name,first_name,middle_name,suffix,in_regards_to,case_status,row_of_origin]
ignoreFirstLine : true
trim : true
charset : UTF-8
}
}
{ logDebug { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer.
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
--
Thanks
Created 03-28-2017 08:25 AM
Glad to hear the original error went away.
The new error is related to schema and morphline.conf fields mismatch
if you want an id field as timestamp you have to use this
http://kitesdk.org/docs/1.1.0/morphlines/morphlines-reference-guide.html#convertTimestamp
else for now do you want to test it with a string field data and see if that works?Just try with 3-4 columns
Also look at this for unique field definition
https://wiki.apache.org/solr/UniqueKey
Also, Can you provide a full stack trace of the error?
Also, for testing purpose, you can pass --dry-run option with your MRIT command and once that succeed you can try "go -live"
Created 03-28-2017 09:15 AM
The key being a string is not an issue, as there will be no searches based on the timestamp. Is there a way in the morphline to specify that the field is ineeded a string and not a timestamp?
Below is the full stack trace:
Error: java.io.IOException: Batch Write Failure
at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:181)
at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:275)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=1966-05-19 10:36:59.365118] unknown field 'file_length'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:940)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1095)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:701)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2135)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.solr.hadoop.BatchWriter.runUpdate(BatchWriter.java:135)
at org.apache.solr.hadoop.BatchWriter$Batch.run(BatchWriter.java:90)
at org.apache.solr.hadoop.BatchWriter.queueBatch(BatchWriter.java:180)
... 9 more
98871 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool - Job failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper, jobId: job_1489673434857_0012
Created 03-28-2017 07:02 PM
Created 03-29-2017 04:50 AM
Frankly, I am at a loss. There was some mis-match between my morphline fields and the schema, but I fixed that. There are no columns not accounted for: Here is my schema, I confirmed by getting it from the Solr web interface:
<uniqueKey>id</uniqueKey> <field name="county" type="text_general" indexed="false" stored="true"/> <field name="year" type="int" indexed="false" stored="true"/> <field name="court_type" type="text_general" indexed="false" stored="true"/> <field name="seq_num" type="int" indexed="false" stored="true"/> <field name="party_role" type="text_general" indexed="false" stored="true"/> <field name="party_num" type="int" indexed="false" stored="true"/> <field name="party_status" type="text_general" indexed="false" stored="true"/> <field name="biz_name" type="text_general" indexed="true" stored="true"/> <field name="prefix" type="text_general" indexed="false" stored="true"/> <field name="last_name" type="text_general" indexed="true" stored="true"/> <field name="first_name" type="text_general" indexed="true" stored="true"/> <field name="middle_name" type="text_general" indexed="true" stored="true"/> <field name="suffix" type="text_general" indexed="false" stored="true"/> <field name="in_regards_to" type="string" indexed="false" stored="true"/> <field name="case_status" type="string" indexed="false" stored="true"/> <field name="row_of_origin" type="string" indexed="false" stored="true"/>
And here is the fields as defined in readCSV.conf:
columns : [id,county,year,court_type,seq_num,party_role,party_num,party_status,biz_name,prefix,last_name,first_name,middle_name,suffix,in_regards_to,case_status,row_of_origin]
They are identical. Still same exception. Any other advise is appreciated.
Created 03-30-2017 06:51 AM
Is there an alternative way to index hdfs files to be used by Solr other than using MapReduceIndexerTool . Like a map-reduce java program. Any samples that can be shared are welcome.
Created 03-31-2017 02:42 PM
Simple solr indexing example using post:
http://www.solrtutorial.com/solr-in-5-minutes.html
or you can use solrj
http://www.solrtutorial.com/solrj-tutorial.html
even you can try indexing using flume and moprhline solr sink
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/search_tutorial.html