<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Check replication factor for a directory in hdfs in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295218#M217627</link>
    <description>&lt;DIV class="c-virtual_list__item"&gt;&lt;DIV class="c-message_kit__background c-message_kit__message c-message_kit__thread_message"&gt;&lt;DIV class="c-message_kit__hover"&gt;&lt;DIV class="c-message_kit__actions c-message_kit__actions--default"&gt;&lt;DIV class="c-message_kit__gutter"&gt;&lt;DIV class="c-message_kit__gutter__right"&gt;&lt;DIV class="c-message_kit__blocks c-message_kit__blocks--rich_text"&gt;&lt;DIV class="c-message__message_blocks c-message__message_blocks--rich_text"&gt;&lt;DIV class="p-block_kit_renderer"&gt;&lt;DIV class="p-block_kit_renderer__block_wrapper p-block_kit_renderer__block_wrapper--first"&gt;&lt;DIV class="p-rich_text_block"&gt;&lt;DIV class="p-rich_text_section"&gt;It was either written with less repicas by the client, or someone changed it after it was written. For example Solr Tlogs I believe are written with a replica of 1.&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class="c-virtual_list__item"&gt;&lt;DIV class="c-message_kit__background c-message_kit__background--hovered c-message_kit__message c-message_kit__thread_message"&gt;&lt;DIV class="c-message_kit__hover c-message_kit__hover--hovered"&gt;&lt;DIV class="c-message_kit__actions c-message_kit__actions--default"&gt;&lt;DIV class="c-message_kit__gutter"&gt;&lt;DIV class="c-message_kit__gutter__right"&gt;&lt;DIV class="c-message_kit__blocks c-message_kit__blocks--rich_text"&gt;&lt;DIV class="c-message__message_blocks c-message__message_blocks--rich_text"&gt;&lt;DIV class="p-block_kit_renderer"&gt;&lt;DIV class="p-block_kit_renderer__block_wrapper p-block_kit_renderer__block_wrapper--first"&gt;&lt;DIV class="p-rich_text_block"&gt;&lt;DIV class="p-rich_text_section"&gt;Each DFSClient has the ability to control the number of replicas.&amp;nbsp; As&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;said, Solr uses 1 for Tlogs, MR uses (or used to use) 10 for job files for better chance of data locality.&amp;nbsp; It’s a decision made by whoever creates the client.&amp;nbsp; So it is expected that any file&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;can&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;have a different replication factor, within the limits of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dfs.namenode.replication.min&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dfs.replication.max&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;which is enforced by the NameNode.&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Thu, 30 Apr 2020 17:33:16 GMT</pubDate>
    <dc:creator>GangWar</dc:creator>
    <dc:date>2020-04-30T17:33:16Z</dc:date>
    <item>
      <title>Check replication factor for a directory in hdfs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295052#M217563</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Is there a way to check the replication factor of a particular folder in HDFS?&lt;/P&gt;&lt;P&gt;While we have default replication set to 3 in CM for some reason files being uploaded in a particular folder shows up with replication factor of 1.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Wert&lt;/P&gt;</description>
      <pubDate>Wed, 29 Apr 2020 09:49:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295052#M217563</guid>
      <dc:creator>wert_1311</dc:creator>
      <dc:date>2020-04-29T09:49:06Z</dc:date>
    </item>
    <item>
      <title>Re: Check replication factor for a directory in hdfs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295060#M217571</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/29490"&gt;@wert_1311&lt;/a&gt;&amp;nbsp;You can use the HDFS command line to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;ls&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;the file.&lt;/P&gt;&lt;P&gt;The second column of the output will show the replication factor of the file.&lt;/P&gt;&lt;P&gt;For example,&lt;/P&gt;&lt;PRE&gt;$ hdfs dfs -ls  /usr/GroupStorage/data1/&lt;SPAN class="hljs-keyword"&gt;out&lt;/SPAN&gt;.txt
-rw-r--r--   &lt;SPAN class="hljs-number"&gt;3&lt;/SPAN&gt; &lt;A href="https://www.systutorials.com/goto/hadoop" target="_blank" rel="noopener"&gt;hadoop&lt;/A&gt; test &lt;SPAN class="hljs-number"&gt;11906625598&lt;/SPAN&gt; &lt;SPAN class="hljs-number"&gt;2020&lt;/SPAN&gt;&lt;SPAN class="hljs-number"&gt;-04&lt;/SPAN&gt;&lt;SPAN class="hljs-number"&gt;-29&lt;/SPAN&gt; &lt;SPAN class="hljs-number"&gt;17&lt;/SPAN&gt;:&lt;SPAN class="hljs-number"&gt;31&lt;/SPAN&gt; /usr/GroupStorage/data1/test&amp;nbsp;&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;Here the replication factor is 3.&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Apr 2020 12:16:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295060#M217571</guid>
      <dc:creator>GangWar</dc:creator>
      <dc:date>2020-04-29T12:16:20Z</dc:date>
    </item>
    <item>
      <title>Re: Check replication factor for a directory in hdfs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295159#M217609</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/29629"&gt;@GangWar&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your reply, what I am trying to zero in, is why some files (recently put) are getting created with RF 1 rather than RF 3. I have checked multiple sites but failed to get some answers, and hitting the wall. Would appreciate if there are any suggestions / pointer that I could check to fix this issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My Issue is as below:&lt;/P&gt;&lt;P&gt;/User 1/logs/User1Logs &amp;gt;&amp;gt;&amp;gt; files under this folder have replication factor of 1.&lt;/P&gt;&lt;P&gt;/User 2/logs/User2Logs &amp;gt;&amp;gt;&amp;gt;&amp;nbsp;files under this folder have replication factor of 1&lt;/P&gt;&lt;P&gt;/User 3/logs/User3Logs &amp;gt;&amp;gt;&amp;gt; files under this folder have replication factor of 3&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Wert&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Apr 2020 06:35:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295159#M217609</guid>
      <dc:creator>wert_1311</dc:creator>
      <dc:date>2020-04-30T06:35:26Z</dc:date>
    </item>
    <item>
      <title>Re: Check replication factor for a directory in hdfs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295218#M217627</link>
      <description>&lt;DIV class="c-virtual_list__item"&gt;&lt;DIV class="c-message_kit__background c-message_kit__message c-message_kit__thread_message"&gt;&lt;DIV class="c-message_kit__hover"&gt;&lt;DIV class="c-message_kit__actions c-message_kit__actions--default"&gt;&lt;DIV class="c-message_kit__gutter"&gt;&lt;DIV class="c-message_kit__gutter__right"&gt;&lt;DIV class="c-message_kit__blocks c-message_kit__blocks--rich_text"&gt;&lt;DIV class="c-message__message_blocks c-message__message_blocks--rich_text"&gt;&lt;DIV class="p-block_kit_renderer"&gt;&lt;DIV class="p-block_kit_renderer__block_wrapper p-block_kit_renderer__block_wrapper--first"&gt;&lt;DIV class="p-rich_text_block"&gt;&lt;DIV class="p-rich_text_section"&gt;It was either written with less repicas by the client, or someone changed it after it was written. For example Solr Tlogs I believe are written with a replica of 1.&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class="c-virtual_list__item"&gt;&lt;DIV class="c-message_kit__background c-message_kit__background--hovered c-message_kit__message c-message_kit__thread_message"&gt;&lt;DIV class="c-message_kit__hover c-message_kit__hover--hovered"&gt;&lt;DIV class="c-message_kit__actions c-message_kit__actions--default"&gt;&lt;DIV class="c-message_kit__gutter"&gt;&lt;DIV class="c-message_kit__gutter__right"&gt;&lt;DIV class="c-message_kit__blocks c-message_kit__blocks--rich_text"&gt;&lt;DIV class="c-message__message_blocks c-message__message_blocks--rich_text"&gt;&lt;DIV class="p-block_kit_renderer"&gt;&lt;DIV class="p-block_kit_renderer__block_wrapper p-block_kit_renderer__block_wrapper--first"&gt;&lt;DIV class="p-rich_text_block"&gt;&lt;DIV class="p-rich_text_section"&gt;Each DFSClient has the ability to control the number of replicas.&amp;nbsp; As&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;said, Solr uses 1 for Tlogs, MR uses (or used to use) 10 for job files for better chance of data locality.&amp;nbsp; It’s a decision made by whoever creates the client.&amp;nbsp; So it is expected that any file&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;can&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;have a different replication factor, within the limits of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dfs.namenode.replication.min&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dfs.replication.max&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;which is enforced by the NameNode.&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 30 Apr 2020 17:33:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/295218#M217627</guid>
      <dc:creator>GangWar</dc:creator>
      <dc:date>2020-04-30T17:33:16Z</dc:date>
    </item>
    <item>
      <title>Re: Check replication factor for a directory in hdfs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/390491#M247273</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/29629"&gt;@GangWar&lt;/a&gt;&amp;nbsp; &lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/29490"&gt;@wert_1311&lt;/a&gt;&amp;nbsp;&amp;nbsp;I have found HDFS files that are persistently under-replicated, despite being over a year old. They are rare, but vulnerable to loss with one disk failure.&amp;nbsp;&lt;/P&gt;&lt;P&gt;To be clear, this shows the replication target, not the actual:&lt;/P&gt;&lt;PRE&gt;hdfs dfs -ls filename &lt;/PRE&gt;&lt;P&gt;&amp;nbsp;The actual can be found with 'hdfs fsck filename -blocks -files filename'&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;In theory, this situation should be transient, but I have found some cases. See example below where a file is 3 blocks in length and one of them only has one replica.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;# hdfs fsck -blocks -files /tmp/part-m-03752 OUTPUT:&lt;BR /&gt;/tmp/part-m-03752: Under replicated BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).&lt;BR /&gt;/tmp/part-m-03752: Replica placement policy is violated for BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Block should be additionally replicated on 1 more rack(s).&lt;BR /&gt;0. BP-955733439-1.2.3.4-1395362440665:blk_1967769089_1100461809406 len=134217728 Live_repl=3&lt;BR /&gt;1. BP-955733439-1.2.3.4-1395362440665:blk_1967769276_1100461809593 len=134217728 Live_repl=3&lt;BR /&gt;2. BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792 len=40324081 Live_repl=1&lt;/P&gt;&lt;P&gt;Status: HEALTHY&lt;BR /&gt;Total size: 308759537 B&lt;BR /&gt;Total dirs: 0&lt;BR /&gt;Total files: 1&lt;BR /&gt;Total symlinks: 0&lt;BR /&gt;Total blocks (validated): 3 (avg. block size 102919845 B)&lt;BR /&gt;Minimally replicated blocks: 3 (100.0 %)&lt;BR /&gt;Over-replicated blocks: 0 (0.0 %)&lt;BR /&gt;Under-replicated blocks: 1 (33.333332 %)&lt;BR /&gt;Mis-replicated blocks: 1 (33.333332 %)&lt;BR /&gt;Default replication factor: 3&lt;BR /&gt;Average block replication: 2.3333333&lt;BR /&gt;Corrupt blocks: 0&lt;BR /&gt;Missing replicas: 2 (22.222221 %)&lt;BR /&gt;Number of data-nodes: 30&lt;BR /&gt;Number of racks: 3&lt;/P&gt;&lt;P&gt;The filesystem under path '/tmp/part-m-03752' is HEALTHY&lt;/P&gt;&lt;P&gt;# hadoop fs -ls /tmp/part-m-03752 OUTPUT:&lt;BR /&gt;-rw-r--r-- 3 wuser hadoop 308759537 2021-12-11 16:58 /tmp/part-m-03752&lt;/P&gt;&lt;P&gt;[sorry, code quoting is not working for me for some reason.]&lt;/P&gt;&lt;P&gt;Presumably,&amp;nbsp;the file was incorrectly replicated when it was written because of some failure and the defaults for dfs.client.block.write.replace-datanode-on-failure props were such that new DNs were not obtained at write time to replace ones that failed. The puzzling thing here is why does it not get re-replicated after all this time?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jul 2024 22:25:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Check-replication-factor-for-a-directory-in-hdfs/m-p/390491#M247273</guid>
      <dc:creator>pbaclace</dc:creator>
      <dc:date>2024-07-16T22:25:04Z</dc:date>
    </item>
  </channel>
</rss>

