It seems that you are using the -update flag with distcp command, which is causing the command to skip files that exist in the destination and have a modification time equal to or newer than the source file. This is the expected behavior of distcp when the -update flag is used.
In your case, even though the content of the file has changed, the size and modification time are still the same, which is causing distcp to skip the file during the copy process.
To copy the updated file to S3, you can try removing the -update flag from the distcp command. This will force distcp to copy all files from the source directory to the destination, regardless of whether they exist in the destination or not.
Your updated command would look like this:
hadoop distcp -pu -delete hdfs_path s3a:
The -pu flag is used to preserve the user and group ownership of the files during the copy process.
Please note that removing the -update flag can cause distcp to copy all files from the source directory to the destination, even if they haven't been modified. This can be time-consuming and may result in unnecessary data transfer costs if you have a large number of files to copy.
If you only want to copy specific files that have been modified, you can use a different tool such as s3-dist-cp or aws s3 sync that supports checksum-based incremental copies. These tools use checksums to determine which files have been modified and need to be copied, rather than relying on modification times or file sizes.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.