Created on 04-13-2018 07:54 AM - edited 09-16-2022 06:06 AM
Hi All,
We’ve installed SolR and now we are trying to index a CSV file with 250 million rows using the toll “MapReduceIndexerTool”.
After 90 minutes running, we are receiving the error:
8/04/13 13:44:06 INFO mapreduce.Job: map 67% reduce 0%
18/04/13 13:49:13 INFO mapreduce.Job: map 100% reduce 0%
18/04/13 13:49:19 INFO mapreduce.Job: Task Id : attempt_1523546159827_0013_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#10
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:441)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
at org.apache.hadoop.mapreduce.task.reduce.OnDiskMapOutput.<init>(OnDiskMapOutput.java:65)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:269)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:539)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:348)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
The command we execute is:
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-*/jars/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --morphline-file /tmp/morphlines_pocsim.conf --output-dir hdfs://vitl000361:8020/user/hdfs/pocsims --verbose --go-live --collection pocsims --zk localhost:2181/solr hdfs:///data/pocsims/trim-exported.csv
the morphline file used:
SOLR_LOCATOR :
{ collection : POC_Sims
zkHost : "vitl000367:2181/solr,vitl000368:2181/solr,vitl000369:2181/solr"
batchSize : 1000 # batchSize
}
morphlines : [
{ id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{ readCSV {
separator : ","
columns : [ID,DT_CR,DT_FU,DT_AT,HBL,HBT,EC,ST,CUS,CSP,RAG,CSP_N,NGIN,SIP]
ignoreFirstLine : true
quoteChar : ""
trim : false
charset : UTF-8
}
}
{ generateUUID { field : id } }
{ logDebug { format : "output record: {}", args : ["@{}"] }}
{ sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} }}
{ loadSolr { solrLocator : ${SOLR_LOCATOR} } }
]
}
]
Could you please help us with this problem?
Regards,
Ricardo Matos
Created 04-17-2018 08:27 AM
I use the same command and have no issues.
According to logs:
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out
So, I would guess that you csv is too big and when the reducer tries to load it, there is no sufficient space in local dirs of YARN nodemanager.
Can you try set more reducers by using :
--reducers 4
or more (based on your partitions and the csv size). You can also set more mappers, but based on log the reducer is suffering.
More details:
Created 04-17-2018 08:27 AM
I use the same command and have no issues.
According to logs:
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out
So, I would guess that you csv is too big and when the reducer tries to load it, there is no sufficient space in local dirs of YARN nodemanager.
Can you try set more reducers by using :
--reducers 4
or more (based on your partitions and the csv size). You can also set more mappers, but based on log the reducer is suffering.
More details: