Reply
New Contributor
Posts: 4
Registered: ‎04-10-2015
Accepted Solution

disk space issue on nodes for distcp data transfer from hdfs to s3

Hi,

 

  I am using following command to transfer data from hdfs to s3.

 

hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs  hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/

 

What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs.

 

Is there a way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?

 

Thanks in advance.

Jagdish

New Contributor
Posts: 4
Registered: ‎04-10-2015

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

Alright ! I figured out the fix for this.

 

The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.

 

The default config is as shown below.

 

<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>

 

This doesn't require any services restart so is an easy fix. 

 

Posts: 1,657
Kudos: 320
Solutions: 258
Registered: ‎07-31-2013

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

Thank you for following up with the found solution here! It will
benefit others looking for similar info.

We also recommend use of the S3A connector going forward, via the s3a:// scheme.

New Contributor
Posts: 4
Registered: ‎04-10-2015

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

Thanks Harsh.
Actually I tried s3a however it is throwing filesystem exception as
"java.io.IOException: No FileSystem for scheme: s3a"
Looks like some jars conflict issue, though didn't get chance to look deep enough.
New Contributor
Posts: 2
Registered: ‎02-11-2018

Re: disk space issue on nodes for distcp data transfer from hdfs to s3

[ Edited ]

Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.

Should it be set service wide?

Announcements