<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Hadoop Distcp -update skips file in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Hadoop-Distcp-update-skips-file/m-p/360765#M238445</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am using distcp to copy data from hadoop hdfs to s3. below is the shorthand command of what i use&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;hadoop distcp -pu -update -delete hdfs_path s3a://bucket&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;recently got into an issue with the below case&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i have a file in hdfs -&amp;gt; temp_file with data 1234567890 with size 27kb&lt;/P&gt;&lt;P&gt;for the first time when i use distcp. it pushes the file to s3 bucket without any issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;second time i update the same file temp_file with different content abcdefghij but with same size 27kb&lt;/P&gt;&lt;P&gt;now when i run distcp. instead of checking the checksum of source and target distcp skips the file directly and doesnt copy the updated file from hdfs to s3&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Am i missing any options in distcp command to make this scenario work?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 21 Apr 2026 07:45:57 GMT</pubDate>
    <dc:creator>rajilion</dc:creator>
    <dc:date>2026-04-21T07:45:57Z</dc:date>
    <item>
      <title>Hadoop Distcp -update skips file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hadoop-Distcp-update-skips-file/m-p/360765#M238445</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am using distcp to copy data from hadoop hdfs to s3. below is the shorthand command of what i use&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;hadoop distcp -pu -update -delete hdfs_path s3a://bucket&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;recently got into an issue with the below case&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i have a file in hdfs -&amp;gt; temp_file with data 1234567890 with size 27kb&lt;/P&gt;&lt;P&gt;for the first time when i use distcp. it pushes the file to s3 bucket without any issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;second time i update the same file temp_file with different content abcdefghij but with same size 27kb&lt;/P&gt;&lt;P&gt;now when i run distcp. instead of checking the checksum of source and target distcp skips the file directly and doesnt copy the updated file from hdfs to s3&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Am i missing any options in distcp command to make this scenario work?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Apr 2026 07:45:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hadoop-Distcp-update-skips-file/m-p/360765#M238445</guid>
      <dc:creator>rajilion</dc:creator>
      <dc:date>2026-04-21T07:45:57Z</dc:date>
    </item>
    <item>
      <title>Re: Hadoop Distcp -update skips file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hadoop-Distcp-update-skips-file/m-p/362089#M238711</link>
      <description>&lt;P&gt;HI&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/94359"&gt;@rajilion&lt;/a&gt;&amp;nbsp;, Thanks for reaching out to Cloudera community. Can you please test the Update and overwrite mentioned in the below article and let us know how it goes -&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html" target="_blank"&gt;https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jan 2023 05:07:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hadoop-Distcp-update-skips-file/m-p/362089#M238711</guid>
      <dc:creator>arunek95</dc:creator>
      <dc:date>2023-01-24T05:07:40Z</dc:date>
    </item>
    <item>
      <title>Re: Hadoop Distcp -update skips file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hadoop-Distcp-update-skips-file/m-p/368305#M240121</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/94359"&gt;@rajilion&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It seems that you are using the -update flag with distcp command, which is causing the command to skip files that exist in the destination and have a modification time equal to or newer than the source file. This is the expected behavior of distcp when the -update flag is used.&lt;/P&gt;&lt;P&gt;In your case, even though the content of the file has changed, the size and modification time are still the same, which is causing distcp to skip the file during the copy process.&lt;/P&gt;&lt;P&gt;To copy the updated file to S3, you can try removing the -update flag from the distcp command. This will force distcp to copy all files from the source directory to the destination, regardless of whether they exist in the destination or not.&lt;/P&gt;&lt;P&gt;Your updated command would look like this:&lt;/P&gt;&lt;DIV class="bg-black rounded-md mb-4"&gt;&lt;DIV class="p-4 overflow-y-auto"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;PRE&gt;hadoop distcp -pu -&lt;SPAN class="hljs-keyword"&gt;delete&lt;/SPAN&gt; hdfs_path s3a:&lt;SPAN class="hljs-comment"&gt;//bucket&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;The -pu flag is used to preserve the user and group ownership of the files during the copy process.&lt;/P&gt;&lt;P&gt;Please note that removing the -update flag can cause distcp to copy all files from the source directory to the destination, even if they haven't been modified. This can be time-consuming and may result in unnecessary data transfer costs if you have a large number of files to copy.&lt;/P&gt;&lt;P&gt;If you only want to copy specific files that have been modified, you can use a different tool such as s3-dist-cp or aws s3 sync that supports checksum-based incremental copies. These tools use checksums to determine which files have been modified and need to be copied, rather than relying on modification times or file sizes.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;If you found that the provided solution(s) assisted you with your query, please take a moment to login and click&lt;STRONG&gt;&lt;FONT color="#FF6600"&gt; Accept as Solution&lt;/FONT&gt;&lt;/STRONG&gt; below each response that helped.&lt;/P&gt;</description>
      <pubDate>Wed, 12 Apr 2023 08:48:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hadoop-Distcp-update-skips-file/m-p/368305#M240121</guid>
      <dc:creator>Kartik_Agarwal</dc:creator>
      <dc:date>2023-04-12T08:48:11Z</dc:date>
    </item>
  </channel>
</rss>

