Support Questions

Find answers, ask questions, and share your expertise

disk space issue on nodes for distcp data transfer from hdfs to s3

avatar

Hi,

 

  I am using following command to transfer data from hdfs to s3.

 

hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs  hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/

 

What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs.

 

Is there a way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?

 

Thanks in advance.

Jagdish

1 ACCEPTED SOLUTION

avatar

Alright ! I figured out the fix for this.

 

The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.

 

The default config is as shown below.

 

<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>

 

This doesn't require any services restart so is an easy fix. 

 

View solution in original post

4 REPLIES 4

avatar

Alright ! I figured out the fix for this.

 

The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.

 

The default config is as shown below.

 

<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>

 

This doesn't require any services restart so is an easy fix. 

 

avatar
Mentor
Thank you for following up with the found solution here! It will
benefit others looking for similar info.

We also recommend use of the S3A connector going forward, via the s3a:// scheme.

avatar
Thanks Harsh.
Actually I tried s3a however it is throwing filesystem exception as
"java.io.IOException: No FileSystem for scheme: s3a"
Looks like some jars conflict issue, though didn't get chance to look deep enough.

avatar
New Contributor

Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.

Should it be set service wide?