<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Why can't I partition a 1 gigabyte dataset into 300 partitions? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/m-p/86549#M55115</link>
    <description>&lt;P&gt;Hey guys, I have already asked this on multiple forums but never got a reply, so I thought that I might get one here.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have an about 1 gig dataset, and it's got a "cityid" column of which there are 324 unique values, so after partitioning I should get 324 folders in hdfs. But whenever I partition, it fails, you can look at the exception messages here&amp;nbsp;&lt;A href="https://community.hortonworks.com/questions/238893/notenoughreplicasexception-when-writing-into-a-par.html" target="_blank"&gt;https://community.hortonworks.com/questions/238893/notenoughreplicasexception-when-writing-into-a-par.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It's definitely an HDFS issue, because everything worked out on MapR. What could possible be the problem?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Btw, I tried this on a fresh install of hortonworks and cloudera and with default settings, so nothing was compromised.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you need any more details please ask.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could this be a setup issue or something? Like maybe I need to increase memory somewhere in the HDFS or something?&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 14:09:57 GMT</pubDate>
    <dc:creator>paulpaul1076</dc:creator>
    <dc:date>2022-09-16T14:09:57Z</dc:date>
    <item>
      <title>Why can't I partition a 1 gigabyte dataset into 300 partitions?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/m-p/86549#M55115</link>
      <description>&lt;P&gt;Hey guys, I have already asked this on multiple forums but never got a reply, so I thought that I might get one here.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have an about 1 gig dataset, and it's got a "cityid" column of which there are 324 unique values, so after partitioning I should get 324 folders in hdfs. But whenever I partition, it fails, you can look at the exception messages here&amp;nbsp;&lt;A href="https://community.hortonworks.com/questions/238893/notenoughreplicasexception-when-writing-into-a-par.html" target="_blank"&gt;https://community.hortonworks.com/questions/238893/notenoughreplicasexception-when-writing-into-a-par.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It's definitely an HDFS issue, because everything worked out on MapR. What could possible be the problem?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Btw, I tried this on a fresh install of hortonworks and cloudera and with default settings, so nothing was compromised.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you need any more details please ask.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could this be a setup issue or something? Like maybe I need to increase memory somewhere in the HDFS or something?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 14:09:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/m-p/86549#M55115</guid>
      <dc:creator>paulpaul1076</dc:creator>
      <dc:date>2022-09-16T14:09:57Z</dc:date>
    </item>
    <item>
      <title>Re: Why can't I partitioned a 1 gigabyte dataset into 300 partitions?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/m-p/86552#M55116</link>
      <description>If you are dealing with unordered partitioning from a data source, you can&lt;BR /&gt;end up creating a lot of files in parallel as the partitioning is attempted.&lt;BR /&gt;&lt;BR /&gt;In HDFS, when a file (or more specifically, its block) is open, the&lt;BR /&gt;DataNode performs a logical reservation of its target block size. So if&lt;BR /&gt;your configured block size is 128 MiB, then every concurrently open block&lt;BR /&gt;will deduct that value (logically) from the available remaining space the&lt;BR /&gt;DataNode publishes to the NameNode.&lt;BR /&gt;&lt;BR /&gt;This reservation is done to help manage space and guarantees of a full&lt;BR /&gt;block write to a client, so that a client that's begun writing its file&lt;BR /&gt;never runs into an out of space exception mid-way.&lt;BR /&gt;&lt;BR /&gt;Note: When the file is closed, only the actual length is persisted, and the&lt;BR /&gt;reservation calculation is adjusted to reflect the reality of used and&lt;BR /&gt;available space. However, while the file block remains open, its always&lt;BR /&gt;considered to be holding a full block size.&lt;BR /&gt;&lt;BR /&gt;The NameNode further will only select a DataNode for a write if it can&lt;BR /&gt;guarantee full target block size. It will ignore any DataNodes it deems&lt;BR /&gt;(based on its reported values and metrics) unfit for the requested write's&lt;BR /&gt;parameters. Your error shows that the NameNode has stopped considering your&lt;BR /&gt;only live DataNode when trying to allocate a new block request.&lt;BR /&gt;&lt;BR /&gt;As an example, 70 GiB of available space will prove insufficient if there&lt;BR /&gt;will be more than 560 concurrent, open files (70 GiB divided into 128 MiB&lt;BR /&gt;block sizes). So the DataNode will 'appear full' at the point of ~560 open&lt;BR /&gt;files, and will no longer serve as a valid target for further file requests.&lt;BR /&gt;&lt;BR /&gt;It appears per your description of the insert that this is likely, as each&lt;BR /&gt;of the 300 chunks of the dataset may still carry varied IDs, resulting in a&lt;BR /&gt;lot of open files requested per parallel task, for insert into several&lt;BR /&gt;different partitions.&lt;BR /&gt;&lt;BR /&gt;You could 'hack' your way around this by reducing the request block size&lt;BR /&gt;within the query (set dfs.blocksize to 8 MiB for ex.), influencing the&lt;BR /&gt;reservation calculation. However, this may not be a good idea for larger&lt;BR /&gt;datasets as you scale, since it will drive up the file:block count and&lt;BR /&gt;increase memory costs for the NameNode.&lt;BR /&gt;&lt;BR /&gt;A better way to approach this would be to perform a pre-partitioned insert&lt;BR /&gt;(sort first by partition and then insert in a partitioned manner). Hive for&lt;BR /&gt;example provides this as an option: hive.optimize.sort.dynamic.partition&lt;BR /&gt;[1], and if you use plain Spark or MapReduce then their default strategy of&lt;BR /&gt;partitioning does exactly this.&lt;BR /&gt;&lt;BR /&gt;[1] -&lt;BR /&gt;&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.sort.dynamic.partition" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.sort.dynamic.partition&lt;/A&gt;&lt;BR /&gt;</description>
      <pubDate>Sat, 16 Feb 2019 02:58:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/m-p/86552#M55116</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2019-02-16T02:58:42Z</dc:date>
    </item>
    <item>
      <title>Re: Why can't I partitioned a 1 gigabyte dataset into 300 partitions?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/m-p/86554#M55117</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/213"&gt;@Harsh J&lt;/a&gt;&amp;nbsp;you are a genius! Thanks a lot!&lt;/P&gt;</description>
      <pubDate>Sat, 16 Feb 2019 04:03:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/m-p/86554#M55117</guid>
      <dc:creator>paulpaul1076</dc:creator>
      <dc:date>2019-02-16T04:03:33Z</dc:date>
    </item>
  </channel>
</rss>

