MapReduceIndexerTool problem

rribeirom — Fri, 16 Sep 2022 13:06:05 GMT

Hi All,

We’ve installed SolR and now we are trying to index a CSV file with 250 million rows using the toll “MapReduceIndexerTool”.

After 90 minutes running, we are receiving the error:

8/04/13 13:44:06 INFO mapreduce.Job: map 67% reduce 0%
18/04/13 13:49:13 INFO mapreduce.Job: map 100% reduce 0%
18/04/13 13:49:19 INFO mapreduce.Job: Task Id : attempt_1523546159827_0013_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#10
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:441)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
at org.apache.hadoop.mapreduce.task.reduce.OnDiskMapOutput.<init>(OnDiskMapOutput.java:65)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:269)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:539)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:348)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)

The command we execute is:

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-*/jars/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --morphline-file /tmp/morphlines_pocsim.conf --output-dir hdfs://vitl000361:8020/user/hdfs/pocsims --verbose --go-live --collection pocsims --zk localhost:2181/solr hdfs:///data/pocsims/trim-exported.csv

the morphline file used:

SOLR_LOCATOR :
{ collection : POC_Sims
zkHost : "vitl000367:2181/solr,vitl000368:2181/solr,vitl000369:2181/solr"
batchSize : 1000 # batchSize
}

morphlines : [
{ id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{ readCSV {
separator : ","
columns : [ID,DT_CR,DT_FU,DT_AT,HBL,HBT,EC,ST,CUS,CSP,RAG,CSP_N,NGIN,SIP]
ignoreFirstLine : true
quoteChar : ""
trim : false
charset : UTF-8
}
}
{ generateUUID { field : id } }
{ logDebug { format : "output record: {}", args : ["@{}"] }}
{ sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} }}
{ loadSolr { solrLocator : ${SOLR_LOCATOR} } }
]
}
]

Could you please help us with this problem?

Regards,

Ricardo Matos

Re: MapReduceIndexerTool problem

GeKas — Tue, 17 Apr 2018 15:27:12 GMT

I use the same command and have no issues.

According to logs:

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out

So, I would guess that you csv is too big and when the reducer tries to load it, there is no sufficient space in local dirs of YARN nodemanager.

Can you try set more reducers by using :

--reducers 4

or more (based on your partitions and the csv size). You can also set more mappers, but based on log the reducer is suffering.

More details:

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/search_mapreduceindexertool.html#concept_pjs_3sd_3v

question Re: MapReduceIndexerTool problem in Archives of Support Questions (Read Only)

MapReduceIndexerTool problem

Re: MapReduceIndexerTool problem