- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
disk space issue on nodes for distcp data transfer from hdfs to s3
- Labels:
-
Apache Hadoop
-
Apache YARN
-
HDFS
Created on ‎04-10-2015 12:27 AM - edited ‎09-16-2022 02:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using following command to transfer data from hdfs to s3.
hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/
What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs.
Is there a way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?
Thanks in advance.
Jagdish
Created ‎04-14-2015 06:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alright ! I figured out the fix for this.
The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.
The default config is as shown below.
<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>
This doesn't require any services restart so is an easy fix.
Created ‎04-14-2015 06:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alright ! I figured out the fix for this.
The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.
The default config is as shown below.
<property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the S3 filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>
This doesn't require any services restart so is an easy fix.
Created ‎04-14-2015 08:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
benefit others looking for similar info.
We also recommend use of the S3A connector going forward, via the s3a:// scheme.
Created ‎04-15-2015 02:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually I tried s3a however it is throwing filesystem exception as
"java.io.IOException: No FileSystem for scheme: s3a"
Looks like some jars conflict issue, though didn't get chance to look deep enough.
Created on ‎02-11-2018 08:18 PM - edited ‎02-11-2018 08:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.
Should it be set service wide?
