04-10-2015 12:27 AM
I am using following command to transfer data from hdfs to s3.
hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/
What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs.
Is there a way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?
Thanks in advance.
04-14-2015 06:51 AM
Alright ! I figured out the fix for this.
The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.
The default config is as shown below.
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
This doesn't require any services restart so is an easy fix.
04-14-2015 08:35 AM
04-15-2015 02:55 AM
02-11-2018 08:18 PM - edited 02-11-2018 08:23 PM
Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.
Should it be set service wide?