Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hadoop Distcp -update skips file

avatar
New Contributor

Hi,

I am using distcp to copy data from hadoop hdfs to s3. below is the shorthand command of what i use

 

hadoop distcp -pu -update -delete hdfs_path s3a://bucket

 

recently got into an issue with the below case

 

i have a file in hdfs -> temp_file with data 1234567890 with size 27kb

for the first time when i use distcp. it pushes the file to s3 bucket without any issue.

 

second time i update the same file temp_file with different content abcdefghij but with same size 27kb

now when i run distcp. instead of checking the checksum of source and target distcp skips the file directly and doesnt copy the updated file from hdfs to s3

 

Am i missing any options in distcp command to make this scenario work?

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

@rajilion 

It seems that you are using the -update flag with distcp command, which is causing the command to skip files that exist in the destination and have a modification time equal to or newer than the source file. This is the expected behavior of distcp when the -update flag is used.

In your case, even though the content of the file has changed, the size and modification time are still the same, which is causing distcp to skip the file during the copy process.

To copy the updated file to S3, you can try removing the -update flag from the distcp command. This will force distcp to copy all files from the source directory to the destination, regardless of whether they exist in the destination or not.

Your updated command would look like this:

 
hadoop distcp -pu -delete hdfs_path s3a://bucket

The -pu flag is used to preserve the user and group ownership of the files during the copy process.

Please note that removing the -update flag can cause distcp to copy all files from the source directory to the destination, even if they haven't been modified. This can be time-consuming and may result in unnecessary data transfer costs if you have a large number of files to copy.

If you only want to copy specific files that have been modified, you can use a different tool such as s3-dist-cp or aws s3 sync that supports checksum-based incremental copies. These tools use checksums to determine which files have been modified and need to be copied, rather than relying on modification times or file sizes.

 


If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

HI @rajilion , Thanks for reaching out to Cloudera community. Can you please test the Update and overwrite mentioned in the below article and let us know how it goes - 

 

https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html

avatar
Master Collaborator

@rajilion 

It seems that you are using the -update flag with distcp command, which is causing the command to skip files that exist in the destination and have a modification time equal to or newer than the source file. This is the expected behavior of distcp when the -update flag is used.

In your case, even though the content of the file has changed, the size and modification time are still the same, which is causing distcp to skip the file during the copy process.

To copy the updated file to S3, you can try removing the -update flag from the distcp command. This will force distcp to copy all files from the source directory to the destination, regardless of whether they exist in the destination or not.

Your updated command would look like this:

 
hadoop distcp -pu -delete hdfs_path s3a://bucket

The -pu flag is used to preserve the user and group ownership of the files during the copy process.

Please note that removing the -update flag can cause distcp to copy all files from the source directory to the destination, even if they haven't been modified. This can be time-consuming and may result in unnecessary data transfer costs if you have a large number of files to copy.

If you only want to copy specific files that have been modified, you can use a different tool such as s3-dist-cp or aws s3 sync that supports checksum-based incremental copies. These tools use checksums to determine which files have been modified and need to be copied, rather than relying on modification times or file sizes.

 


If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.