Created 06-29-2016 09:16 PM
Created 06-30-2016 04:54 AM
There is no direct utility to find this.
The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.
For example:
# hdfs dfs -ls /tmp/tst Found 6 items -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/okay -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/pass -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/pass3 -rw-r--r-- 3 hdfs hdfs 1064 2016-06-29 21:46 /tmp/tst/pre -rw-r--r-- 3 hdfs hdfs 1064 2016-06-29 21:46 /tmp/tst/pro -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/word
# hdfs dfs -checksum /tmp/tst/okay /tmp/tst/okay MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 # hdfs dfs -checksum /tmp/tst/pass /tmp/tst/pass MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 # hdfs dfs -checksum /tmp/tst/pre /tmp/tst/pre MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d # hdfs dfs -checksum /tmp/tst/pro /tmp/tst/pro MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
From the above, the files "/tmp/tst/okay" and "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".
To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:
# hdfs dfs -checksum /tmp/tst/* /tmp/tst/okay MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 /tmp/tst/pass MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 /tmp/tst/pass3 MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 /tmp/tst/pre MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d /tmp/tst/pro MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d /tmp/tst/word MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
Also, you can use "hdfs find" to make a large search:
# hdfs dfs -checksum `hdfs dfs -find /tmp -print`
The above command will list checksum of all the files.
You can also run with "sort and uniq " as :
hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'
Created 06-30-2016 04:54 AM
There is no direct utility to find this.
The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.
For example:
# hdfs dfs -ls /tmp/tst Found 6 items -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/okay -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/pass -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/pass3 -rw-r--r-- 3 hdfs hdfs 1064 2016-06-29 21:46 /tmp/tst/pre -rw-r--r-- 3 hdfs hdfs 1064 2016-06-29 21:46 /tmp/tst/pro -rw-r--r-- 3 hdfs hdfs 2044 2016-06-29 21:46 /tmp/tst/word
# hdfs dfs -checksum /tmp/tst/okay /tmp/tst/okay MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 # hdfs dfs -checksum /tmp/tst/pass /tmp/tst/pass MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 # hdfs dfs -checksum /tmp/tst/pre /tmp/tst/pre MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d # hdfs dfs -checksum /tmp/tst/pro /tmp/tst/pro MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
From the above, the files "/tmp/tst/okay" and "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".
To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:
# hdfs dfs -checksum /tmp/tst/* /tmp/tst/okay MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 /tmp/tst/pass MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 /tmp/tst/pass3 MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7 /tmp/tst/pre MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d /tmp/tst/pro MD5-of-0MD5-of-512CRC32C 000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d /tmp/tst/word MD5-of-0MD5-of-512CRC32C 000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
Also, you can use "hdfs find" to make a large search:
# hdfs dfs -checksum `hdfs dfs -find /tmp -print`
The above command will list checksum of all the files.
You can also run with "sort and uniq " as :
hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'