Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

MapReduceIndexerTool problem

avatar
New Contributor

Hi All,

 

We’ve installed SolR and now we are trying to index a CSV file with 250 million rows using the toll “MapReduceIndexerTool”.

 

After 90 minutes running, we are receiving the error:

 

8/04/13 13:44:06 INFO mapreduce.Job: map 67% reduce 0%
18/04/13 13:49:13 INFO mapreduce.Job: map 100% reduce 0%
18/04/13 13:49:19 INFO mapreduce.Job: Task Id : attempt_1523546159827_0013_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#10
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:441)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
at org.apache.hadoop.mapreduce.task.reduce.OnDiskMapOutput.<init>(OnDiskMapOutput.java:65)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:269)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:539)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:348)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)

 

The command we execute is:

 

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH-*/jars/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --morphline-file /tmp/morphlines_pocsim.conf --output-dir hdfs://vitl000361:8020/user/hdfs/pocsims --verbose --go-live --collection pocsims --zk localhost:2181/solr hdfs:///data/pocsims/trim-exported.csv

 

the morphline file used:

 

SOLR_LOCATOR :
{ collection : POC_Sims
zkHost : "vitl000367:2181/solr,vitl000368:2181/solr,vitl000369:2181/solr"
batchSize : 1000 # batchSize
}

morphlines : [
{ id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{ readCSV {
separator : ","
columns : [ID,DT_CR,DT_FU,DT_AT,HBL,HBT,EC,ST,CUS,CSP,RAG,CSP_N,NGIN,SIP]
ignoreFirstLine : true
quoteChar : ""
trim : false
charset : UTF-8
}
}
{ generateUUID { field : id } }
{ logDebug { format : "output record: {}", args : ["@{}"] }}
{ sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} }}
{ loadSolr { solrLocator : ${SOLR_LOCATOR} } }
]
}
]

 

Could you please help us with this problem?

 

Regards,

Ricardo Matos

1 ACCEPTED SOLUTION

avatar
Super Collaborator

I use the same command and have no issues.

According to logs:

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out

So, I would guess that you csv is too big and when the reducer tries to load it, there is no sufficient space in local dirs of YARN nodemanager.

 

Can you try set more reducers by using :

--reducers 4

or more (based on your partitions and the csv size). You can also set more mappers, but based on log the reducer is suffering.

More details:

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/search_mapreduceindexertool.html#con...

View solution in original post

1 REPLY 1

avatar
Super Collaborator

I use the same command and have no issues.

According to logs:

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1523546159827_0013_r_000000_0/map_0.out

So, I would guess that you csv is too big and when the reducer tries to load it, there is no sufficient space in local dirs of YARN nodemanager.

 

Can you try set more reducers by using :

--reducers 4

or more (based on your partitions and the csv size). You can also set more mappers, but based on log the reducer is suffering.

More details:

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/search_mapreduceindexertool.html#con...