Support Questions

Find answers, ask questions, and share your expertise

does hdfs dfs -put verifies that the transfer went OK?

avatar
New Contributor

Hello everybody!

 

Sometimes my computer crashes, while the data is transferring to hadoop via the command

hdfs dfs -put myfile myfolder

My question is: does hdfs verifies that the transfer went OK?

 

For instance, by automatically comparing the size of the data on the local drive to the amount of data received?

I am asking these because I am transferring very large files (200 GB) and the transfer take hours. It is not easy for me to check whether the files in hdfs are the correct ones or some partial versions of my local files (due to some interrupted transfer).

 

Many thanks!

1 ACCEPTED SOLUTION

avatar
Mentor
The -put/-copyFromLocal programs follow a rename-upon-complete approach.
When the file is uploading, it will be named as "filename._COPYING_" and
upon closure it will be renamed to "filename". This should help you verify
which files were not entirely copied.

This feature is active by default but if undesirable, can be switched off
with the -d flag.

X-Ref:
https://github.com/cloudera/hadoop-common/blob/cdh5.7.0-release/hadoop-common-project/hadoop-common/...

View solution in original post

3 REPLIES 3

avatar
Mentor
The -put/-copyFromLocal programs follow a rename-upon-complete approach.
When the file is uploading, it will be named as "filename._COPYING_" and
upon closure it will be renamed to "filename". This should help you verify
which files were not entirely copied.

This feature is active by default but if undesirable, can be switched off
with the -d flag.

X-Ref:
https://github.com/cloudera/hadoop-common/blob/cdh5.7.0-release/hadoop-common-project/hadoop-common/...

avatar
New Contributor
excellent! very helpful thanks! just a follow up on the verification thing. It seems to me that, in addition to that, hdfs compares the checksums (of the local vs hdfs copy) to assert that the download is finished. is that correct?

avatar
Mentor
The HDFS client reads your input and sends packets of data (64k-128k chunks
at a time) which are sent along with their checksums over the network, and
the DNs involved in the write verify these continually as they receive
them, before writing them to disk. This way you wouldn't suffer from
network corruptions, and what's written onto the HDFS would match precisely
what the client intended to send.