Support Questions

Find answers, ask questions, and share your expertise

Hadoop Distcp -update skips file

New Contributor

Hi,

I am using distcp to copy data from hadoop hdfs to s3. below is the shorthand command of what i use

 

hadoop distcp -pu -update -delete hdfs_path s3a://bucket

 

recently got into an issue with the below case

 

i have a file in hdfs -> temp_file with data 1234567890 with size 27kb

for the first time when i use distcp. it pushes the file to s3 bucket without any issue.

 

second time i update the same file temp_file with different content abcdefghij but with same size 27kb

now when i run distcp. instead of checking the checksum of source and target distcp skips the file directly and doesnt copy the updated file from hdfs to s3

 

Am i missing any options in distcp command to make this scenario work?

 

1 REPLY 1

Rising Star

HI @rajilion , Thanks for reaching out to Cloudera community. Can you please test the Update and overwrite mentioned in the below article and let us know how it goes - 

 

https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html