<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: I would like to know if there are any data duplication? i.e. multiple datasets with same content (redundant copies ) ? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/I-would-like-to-know-if-there-are-any-data-duplication-i-e/m-p/105066#M33431</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/9842/mpandit.html" nodeid="9842"&gt;@milind pandit&lt;/A&gt;&lt;/P&gt;&lt;P&gt;There is no direct utility  to find this. &lt;/P&gt;&lt;P&gt;The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.&lt;/P&gt;&lt;P&gt;For example:&lt;/P&gt;&lt;PRE&gt;# hdfs dfs -ls /tmp/tst
Found 6 items
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/okay
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass3
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pre
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pro
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/word&lt;/PRE&gt;
&lt;PRE&gt;# hdfs dfs -checksum /tmp/tst/okay
/tmp/tst/okay   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum /tmp/tst/pass
/tmp/tst/pass   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum  /tmp/tst/pre
/tmp/tst/pre   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
# hdfs dfs -checksum  /tmp/tst/pro
/tmp/tst/pro   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
&lt;/PRE&gt;&lt;P&gt;From the above,  the files "/tmp/tst/okay" and  "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above  that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".&lt;/P&gt;&lt;P&gt;To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:&lt;/P&gt;&lt;PRE&gt;# hdfs dfs -checksum /tmp/tst/*
/tmp/tst/okay    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass3    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pre    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/pro    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/word    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7&lt;/PRE&gt;&lt;P&gt;Also, you can use "hdfs find" to make a large search:&lt;/P&gt;&lt;PRE&gt;# hdfs dfs -checksum `hdfs dfs -find /tmp -print`
&lt;/PRE&gt;&lt;P&gt;The above command will list checksum of all the files. &lt;/P&gt;&lt;P&gt;You can also run with "sort and uniq " as :&lt;/P&gt;&lt;PRE&gt;hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'&lt;/PRE&gt;</description>
    <pubDate>Thu, 30 Jun 2016 11:54:48 GMT</pubDate>
    <dc:creator>PARTOMIA</dc:creator>
    <dc:date>2016-06-30T11:54:48Z</dc:date>
    <item>
      <title>I would like to know if there are any data duplication? i.e. multiple datasets with same content (redundant copies ) ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/I-would-like-to-know-if-there-are-any-data-duplication-i-e/m-p/105065#M33430</link>
      <description />
      <pubDate>Thu, 30 Jun 2016 04:16:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/I-would-like-to-know-if-there-are-any-data-duplication-i-e/m-p/105065#M33430</guid>
      <dc:creator>mpandit</dc:creator>
      <dc:date>2016-06-30T04:16:16Z</dc:date>
    </item>
    <item>
      <title>Re: I would like to know if there are any data duplication? i.e. multiple datasets with same content (redundant copies ) ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/I-would-like-to-know-if-there-are-any-data-duplication-i-e/m-p/105066#M33431</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/9842/mpandit.html" nodeid="9842"&gt;@milind pandit&lt;/A&gt;&lt;/P&gt;&lt;P&gt;There is no direct utility  to find this. &lt;/P&gt;&lt;P&gt;The files with different names but same content will have have same checksum. Using checksum option of hdfs , we can verify the same.&lt;/P&gt;&lt;P&gt;For example:&lt;/P&gt;&lt;PRE&gt;# hdfs dfs -ls /tmp/tst
Found 6 items
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/okay
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/pass3
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pre
-rw-r--r--  3 hdfs hdfs  1064 2016-06-29 21:46 /tmp/tst/pro
-rw-r--r--  3 hdfs hdfs  2044 2016-06-29 21:46 /tmp/tst/word&lt;/PRE&gt;
&lt;PRE&gt;# hdfs dfs -checksum /tmp/tst/okay
/tmp/tst/okay   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum /tmp/tst/pass
/tmp/tst/pass   MD5-of-0MD5-of-512CRC32C   000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
# hdfs dfs -checksum  /tmp/tst/pre
/tmp/tst/pre   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
# hdfs dfs -checksum  /tmp/tst/pro
/tmp/tst/pro   MD5-of-0MD5-of-512CRC32C   000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
&lt;/PRE&gt;&lt;P&gt;From the above,  the files "/tmp/tst/okay" and  "/tmp/tst/pass" are holding same content, but the filenames are different. You can see from above  that both files have same checksum. Similarly for "/tmp/tst/pro" and "/tmp/tst/pre".&lt;/P&gt;&lt;P&gt;To check the checksum of files on a folder ( in this case "/tmp/tst" ) , following can be done:&lt;/P&gt;&lt;PRE&gt;# hdfs dfs -checksum /tmp/tst/*
/tmp/tst/okay    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pass3    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7
/tmp/tst/pre    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/pro    MD5-of-0MD5-of-512CRC32C    000002000000000000000000690e462cbf52c9c399fb7c0bcacef01d
/tmp/tst/word    MD5-of-0MD5-of-512CRC32C    000002000000000000000000b1be3e03929521974dc321f9e7f27cc7&lt;/PRE&gt;&lt;P&gt;Also, you can use "hdfs find" to make a large search:&lt;/P&gt;&lt;PRE&gt;# hdfs dfs -checksum `hdfs dfs -find /tmp -print`
&lt;/PRE&gt;&lt;P&gt;The above command will list checksum of all the files. &lt;/P&gt;&lt;P&gt;You can also run with "sort and uniq " as :&lt;/P&gt;&lt;PRE&gt;hdfs dfs -checksum `hdfs dfs -find /tmp -print` | sort | uniq -c | awk '{print $2,$4}'&lt;/PRE&gt;</description>
      <pubDate>Thu, 30 Jun 2016 11:54:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/I-would-like-to-know-if-there-are-any-data-duplication-i-e/m-p/105066#M33431</guid>
      <dc:creator>PARTOMIA</dc:creator>
      <dc:date>2016-06-30T11:54:48Z</dc:date>
    </item>
  </channel>
</rss>

