Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

disk space issue on nodes for distcp data transfer from hdfs to s3

Solved Go to solution

disk space issue on nodes for distcp data transfer from hdfs to s3

New Contributor

Hi,

 

  I am using following command to transfer data from hdfs to s3.

 

hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs  hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/

 

What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs.

 

Is there a way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?

 

Thanks in advance.

Jagdish

1 ACCEPTED SOLUTION

Accepted Solutions

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

New Contributor

Alright ! I figured out the fix for this.

 

The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.

 

The default config is as shown below.

 

<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>

 

This doesn't require any services restart so is an easy fix. 

 

4 REPLIES 4

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

New Contributor

Alright ! I figured out the fix for this.

 

The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.

 

The default config is as shown below.

 

<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>

 

This doesn't require any services restart so is an easy fix. 

 

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

Master Guru
Thank you for following up with the found solution here! It will
benefit others looking for similar info.

We also recommend use of the S3A connector going forward, via the s3a:// scheme.

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

New Contributor
Thanks Harsh.
Actually I tried s3a however it is throwing filesystem exception as
"java.io.IOException: No FileSystem for scheme: s3a"
Looks like some jars conflict issue, though didn't get chance to look deep enough.

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

New Contributor

Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.

Should it be set service wide?

Don't have an account?
Coming from Hortonworks? Activate your account here