Created 01-06-2023 01:28 PM
Hi,
I am using distcp to copy data from hadoop hdfs to s3. below is the shorthand command of what i use
hadoop distcp -pu -update -delete hdfs_path s3a://bucket
recently got into an issue with the below case
i have a file in hdfs -> temp_file with data 1234567890 with size 27kb
for the first time when i use distcp. it pushes the file to s3 bucket without any issue.
second time i update the same file temp_file with different content abcdefghij but with same size 27kb
now when i run distcp. instead of checking the checksum of source and target distcp skips the file directly and doesnt copy the updated file from hdfs to s3
Am i missing any options in distcp command to make this scenario work?
Created 04-12-2023 01:48 AM
It seems that you are using the -update flag with distcp command, which is causing the command to skip files that exist in the destination and have a modification time equal to or newer than the source file. This is the expected behavior of distcp when the -update flag is used.
In your case, even though the content of the file has changed, the size and modification time are still the same, which is causing distcp to skip the file during the copy process.
To copy the updated file to S3, you can try removing the -update flag from the distcp command. This will force distcp to copy all files from the source directory to the destination, regardless of whether they exist in the destination or not.
Your updated command would look like this:
hadoop distcp -pu -delete hdfs_path s3a://bucket
The -pu flag is used to preserve the user and group ownership of the files during the copy process.
Please note that removing the -update flag can cause distcp to copy all files from the source directory to the destination, even if they haven't been modified. This can be time-consuming and may result in unnecessary data transfer costs if you have a large number of files to copy.
If you only want to copy specific files that have been modified, you can use a different tool such as s3-dist-cp or aws s3 sync that supports checksum-based incremental copies. These tools use checksums to determine which files have been modified and need to be copied, rather than relying on modification times or file sizes.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Created 01-23-2023 09:07 PM
HI @rajilion , Thanks for reaching out to Cloudera community. Can you please test the Update and overwrite mentioned in the below article and let us know how it goes -
https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html
Created 04-12-2023 01:48 AM
It seems that you are using the -update flag with distcp command, which is causing the command to skip files that exist in the destination and have a modification time equal to or newer than the source file. This is the expected behavior of distcp when the -update flag is used.
In your case, even though the content of the file has changed, the size and modification time are still the same, which is causing distcp to skip the file during the copy process.
To copy the updated file to S3, you can try removing the -update flag from the distcp command. This will force distcp to copy all files from the source directory to the destination, regardless of whether they exist in the destination or not.
Your updated command would look like this:
hadoop distcp -pu -delete hdfs_path s3a://bucket
The -pu flag is used to preserve the user and group ownership of the files during the copy process.
Please note that removing the -update flag can cause distcp to copy all files from the source directory to the destination, even if they haven't been modified. This can be time-consuming and may result in unnecessary data transfer costs if you have a large number of files to copy.
If you only want to copy specific files that have been modified, you can use a different tool such as s3-dist-cp or aws s3 sync that supports checksum-based incremental copies. These tools use checksums to determine which files have been modified and need to be copied, rather than relying on modification times or file sizes.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.