Created on 04-10-2015 12:27 AM - edited 09-16-2022 02:26 AM
Hi,
I am using following command to transfer data from hdfs to s3.
hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/
What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs.
Is there a way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?
Thanks in advance.
Jagdish
Created 04-14-2015 06:51 AM
Alright ! I figured out the fix for this.
The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.
The default config is as shown below.
<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>
This doesn't require any services restart so is an easy fix.
Created 04-14-2015 06:51 AM
Alright ! I figured out the fix for this.
The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.
The default config is as shown below.
<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>
This doesn't require any services restart so is an easy fix.
Created 04-14-2015 08:35 AM
Created 04-15-2015 02:55 AM
Created on 02-11-2018 08:18 PM - edited 02-11-2018 08:23 PM
Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.
Should it be set service wide?