Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

I would like to know if there are any data duplication? i.e. multiple datasets with same content (redundant copies ) ?

avatar
 
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@milind pandit

There is no direct utility to find this.

The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.

For example:

# hdfs dfs -ls /tmp/tst
Found 6 items
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/okay
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass3
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pre
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pro
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/word
# hdfs dfs -checksum /tmp/tst/okay
/tmp/tst/okay   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum /tmp/tst/pass
/tmp/tst/pass   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum  /tmp/tst/pre
/tmp/tst/pre   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
# hdfs dfs -checksum  /tmp/tst/pro
/tmp/tst/pro   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d

From the above, the files "/tmp/tst/okay" and "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".

To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:

# hdfs dfs -checksum /tmp/tst/*
/tmp/tst/okay    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass3    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pre    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/pro    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/word    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7

Also, you can use "hdfs find" to make a large search:

# hdfs dfs -checksum `hdfs dfs -find /tmp -print`

The above command will list checksum of all the files.

You can also run with "sort and uniq " as :

hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'

View solution in original post

1 REPLY 1

avatar
Expert Contributor

@milind pandit

There is no direct utility to find this.

The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.

For example:

# hdfs dfs -ls /tmp/tst
Found 6 items
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/okay
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass3
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pre
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pro
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/word
# hdfs dfs -checksum /tmp/tst/okay
/tmp/tst/okay   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum /tmp/tst/pass
/tmp/tst/pass   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum  /tmp/tst/pre
/tmp/tst/pre   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
# hdfs dfs -checksum  /tmp/tst/pro
/tmp/tst/pro   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d

From the above, the files "/tmp/tst/okay" and "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".

To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:

# hdfs dfs -checksum /tmp/tst/*
/tmp/tst/okay    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass3    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pre    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/pro    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/word    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7

Also, you can use "hdfs find" to make a large search:

# hdfs dfs -checksum `hdfs dfs -find /tmp -print`

The above command will list checksum of all the files.

You can also run with "sort and uniq " as :

hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'