<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question disk space issue on nodes for distcp data transfer from hdfs to s3 in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26396#M5563</link>
    <description>&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Hi,&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;&amp;nbsp; I am using following command to transfer data from hdfs to s3. &lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs &amp;nbsp;hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs. &lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Is there a&amp;nbsp;way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p3"&gt;&lt;SPAN class="s2"&gt;Thanks in advance.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p3"&gt;&lt;SPAN class="s2"&gt;Jagdish&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 09:26:23 GMT</pubDate>
    <dc:creator>JagdishKewat</dc:creator>
    <dc:date>2022-09-16T09:26:23Z</dc:date>
    <item>
      <title>disk space issue on nodes for distcp data transfer from hdfs to s3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26396#M5563</link>
      <description>&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Hi,&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;&amp;nbsp; I am using following command to transfer data from hdfs to s3. &lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;hadoop distcp -Dmapreduce.map.memory.mb=3096 -Dmapred.task.timeout=60000000 -i -log /tmp/export/logs &amp;nbsp;hdfs:///test/data/export/file.avro s3n://ACCESS_ID:ACCESS_KEY@S3_BUCKET/&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;What I have noticed is mapper task which copies data to s3 first locally copies data into /tmp/hadoop-yarn/s3 directory on individual node. This is causing disk space issues on nodes since the transfer data size is in TBs. &lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Is there a&amp;nbsp;way to configure temporary working directory for mapper? Can it use hdfs disk space rather than local disk space?&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p3"&gt;&lt;SPAN class="s2"&gt;Thanks in advance.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p3"&gt;&lt;SPAN class="s2"&gt;Jagdish&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:26:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26396#M5563</guid>
      <dc:creator>JagdishKewat</dc:creator>
      <dc:date>2022-09-16T09:26:23Z</dc:date>
    </item>
    <item>
      <title>Re: disk space issue on nodes for distcp data transfer from hdfs to s3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26475#M5564</link>
      <description>&lt;P&gt;Alright ! I figured out the fix for this.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The temp buffer directory for S3 is configurable ith the property "fs.s3.buffer.dir" in core-default.xml config file.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The default config is as shown below.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;lt;property&amp;gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;lt;name&amp;gt;fs.s3.buffer.dir&amp;lt;/name&amp;gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;lt;value&amp;gt;${hadoop.tmp.dir}/s3&amp;lt;/value&amp;gt;&lt;/STRONG&gt;&lt;BR /&gt;&amp;lt;description&amp;gt;Determines where on the local filesystem the S3 filesystem&lt;BR /&gt;should store files before sending them to S3&lt;BR /&gt;(or after retrieving them from S3).&lt;BR /&gt;&amp;lt;/description&amp;gt;&lt;BR /&gt;&amp;lt;/property&amp;gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This doesn't require any services restart so is an easy fix.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Apr 2015 13:51:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26475#M5564</guid>
      <dc:creator>JagdishKewat</dc:creator>
      <dc:date>2015-04-14T13:51:50Z</dc:date>
    </item>
    <item>
      <title>Re: disk space issue on nodes for distcp data transfer from hdfs to s3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26480#M5565</link>
      <description>Thank you for following up with the found solution here! It will&lt;BR /&gt;benefit others looking for similar info.&lt;BR /&gt;&lt;BR /&gt;We also recommend use of the S3A connector going forward, via the s3a:// scheme.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 14 Apr 2015 15:35:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26480#M5565</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2015-04-14T15:35:48Z</dc:date>
    </item>
    <item>
      <title>Re: disk space issue on nodes for distcp data transfer from hdfs to s3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26505#M5566</link>
      <description>Thanks Harsh.&lt;BR /&gt;Actually I tried s3a however it is throwing filesystem exception as&lt;BR /&gt;"java.io.IOException: No FileSystem for scheme: s3a"&lt;BR /&gt;Looks like some jars conflict issue, though didn't get chance to look deep enough.</description>
      <pubDate>Wed, 15 Apr 2015 09:55:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/26505#M5566</guid>
      <dc:creator>JagdishKewat</dc:creator>
      <dc:date>2015-04-15T09:55:59Z</dc:date>
    </item>
    <item>
      <title>Re: disk space issue on nodes for distcp data transfer from hdfs to s3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/64529#M5567</link>
      <description>&lt;P&gt;Where this property needs to be set? There is no core-default.xml file in my deployment. I am using CDH 5.12.&lt;/P&gt;&lt;P&gt;Should it be set service wide?&lt;/P&gt;</description>
      <pubDate>Mon, 12 Feb 2018 04:23:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/disk-space-issue-on-nodes-for-distcp-data-transfer-from-hdfs/m-p/64529#M5567</guid>
      <dc:creator>priyam</dc:creator>
      <dc:date>2018-02-12T04:23:43Z</dc:date>
    </item>
  </channel>
</rss>

