Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

I would like to know if there are any data duplication? i.e. multiple datasets with same content (redundant copies ) ?

avatar
 
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@milind pandit

There is no direct utility to find this.

The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.

For example:

# hdfs dfs -ls /tmp/tst
Found 6 items
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/okay
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass3
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pre
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pro
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/word
# hdfs dfs -checksum /tmp/tst/okay
/tmp/tst/okay   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum /tmp/tst/pass
/tmp/tst/pass   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum  /tmp/tst/pre
/tmp/tst/pre   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
# hdfs dfs -checksum  /tmp/tst/pro
/tmp/tst/pro   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d

From the above, the files "/tmp/tst/okay" and "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".

To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:

# hdfs dfs -checksum /tmp/tst/*
/tmp/tst/okay    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass3    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pre    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/pro    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/word    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7

Also, you can use "hdfs find" to make a large search:

# hdfs dfs -checksum `hdfs dfs -find /tmp -print`

The above command will list checksum of all the files.

You can also run with "sort and uniq " as :

hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'

View solution in original post

1 REPLY 1

avatar
Expert Contributor

@milind pandit

There is no direct utility to find this.

The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.

For example:

# hdfs dfs -ls /tmp/tst
Found 6 items
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/okay
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass3
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pre
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pro
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/word
# hdfs dfs -checksum /tmp/tst/okay
/tmp/tst/okay   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum /tmp/tst/pass
/tmp/tst/pass   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum  /tmp/tst/pre
/tmp/tst/pre   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
# hdfs dfs -checksum  /tmp/tst/pro
/tmp/tst/pro   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d

From the above, the files "/tmp/tst/okay" and "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".

To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:

# hdfs dfs -checksum /tmp/tst/*
/tmp/tst/okay    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass3    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pre    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/pro    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/word    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7

Also, you can use "hdfs find" to make a large search:

# hdfs dfs -checksum `hdfs dfs -find /tmp -print`

The above command will list checksum of all the files.

You can also run with "sort and uniq " as :

hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'