Support Questions

JagdishKewat · ‎04-10-2015

Hi,

I am using following command to transfer data from hdfs to s3.

hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/

What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs.

Is there a way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?

Thanks in advance.

Jagdish

JagdishKewat · ‎04-14-2015

Alright ! I figured out the fix for this.

The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.

The default config is as shown below.

<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>

This doesn't require any services restart so is an easy fix.

View solution in original post

JagdishKewat · ‎04-14-2015

Alright ! I figured out the fix for this.

The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.

The default config is as shown below.

<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>

This doesn't require any services restart so is an easy fix.

Harsh J · ‎04-14-2015

Thank you for following up with the found solution here! It will
benefit others looking for similar info.

We also recommend use of the S3A connector going forward, via the s3a:// scheme.

JagdishKewat · ‎04-15-2015

Thanks Harsh.
Actually I tried s3a however it is throwing filesystem exception as
"java.io.IOException: No FileSystem for scheme: s3a"
Looks like some jars conflict issue, though didn't get chance to look deep enough.

priyam · ‎02-11-2018

Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.

Should it be set service wide?

Cloudera Community

Support Questions

disk space issue on nodes for distcp data transfer from hdfs to s3