<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Best practices for partition tables in Impala 3.2.0 in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376094#M242767</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/106852"&gt;@wcg_hdp_manager&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Please review the Impala partitioning best practices guide:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.cloudera.com/best-practices/latest/impala-partitioning/topics/bp-impala-partitioning-considerations.html" target="_blank"&gt;https://docs.cloudera.com/best-practices/latest/impala-partitioning/topics/bp-impala-partitioning-considerations.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;And CDP 7.1.8 Impala partitioning guide:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/impala-reference/topics/impala-partition.html" target="_blank"&gt;https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/impala-reference/topics/impala-partition.html&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do not partition your table if you do not have a good reason to do so. The number of records (100m) itself is not necessarily is a reason. You need to know what kind of queries you will have on your table (do you always know in WHERE clause one or more columns so Impala can take advantage of partition pruning? if not, then anyway the whole dataset might be scanned), how you ingest those (do you load new partitions each day? or some other factors?).&lt;/P&gt;&lt;P&gt;Creating too many partitions will likely implicitly create too many small files instead of less but bigger files. Processing data which are in more datafiles is less efficient and you can put a stress on the HDFS NameNode (which needs to keep track of the many datafiles) if that is going to be a general trend.&lt;/P&gt;&lt;P&gt;Hope this helps,&lt;/P&gt;&lt;P&gt;&amp;nbsp;Miklos&lt;/P&gt;</description>
    <pubDate>Thu, 07 Sep 2023 08:08:34 GMT</pubDate>
    <dc:creator>mszurap</dc:creator>
    <dc:date>2023-09-07T08:08:34Z</dc:date>
    <item>
      <title>Best practices for partition tables in Impala 3.2.0</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376084#M242761</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;We need to partition 5 hive tables with more than 100 million records each. Tables have multiple columns to partition with integer and string data allowing less than 1000 partitions per column. Could we combine several of these fields to partition the tables? would there be a restriction on the maximum number of combined partitions? Should they have a specific order according to the number of partitions in each table? for example, first partition by year (10 partitions) , then by country (50 partitions) and then by city name (500 partitions) ? Must the combination of partitions be less than a particular number?&lt;/P&gt;&lt;P&gt;Best Regards,&lt;/P&gt;&lt;P&gt;H.M&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 06 Sep 2023 22:58:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376084#M242761</guid>
      <dc:creator>wcg_hdp_manager</dc:creator>
      <dc:date>2023-09-06T22:58:41Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices for partition tables in Impala 3.2.0</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376085#M242762</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/106852"&gt;@wcg_hdp_manager&lt;/a&gt;&amp;nbsp;Welcome to the Cloudera Community!&lt;BR /&gt;&lt;BR /&gt;To help you get the best possible solution, I have tagged our Hive expert &lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/12885"&gt;@mszurap&lt;/a&gt;&amp;nbsp;who may be able to assist you further.&lt;BR /&gt;&lt;BR /&gt;Please keep us updated on your post, and we hope you find a satisfactory solution to your query.&lt;/P&gt;</description>
      <pubDate>Wed, 06 Sep 2023 23:18:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376085#M242762</guid>
      <dc:creator>DianaTorres</dc:creator>
      <dc:date>2023-09-06T23:18:35Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices for partition tables in Impala 3.2.0</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376094#M242767</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/106852"&gt;@wcg_hdp_manager&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Please review the Impala partitioning best practices guide:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.cloudera.com/best-practices/latest/impala-partitioning/topics/bp-impala-partitioning-considerations.html" target="_blank"&gt;https://docs.cloudera.com/best-practices/latest/impala-partitioning/topics/bp-impala-partitioning-considerations.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;And CDP 7.1.8 Impala partitioning guide:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/impala-reference/topics/impala-partition.html" target="_blank"&gt;https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/impala-reference/topics/impala-partition.html&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do not partition your table if you do not have a good reason to do so. The number of records (100m) itself is not necessarily is a reason. You need to know what kind of queries you will have on your table (do you always know in WHERE clause one or more columns so Impala can take advantage of partition pruning? if not, then anyway the whole dataset might be scanned), how you ingest those (do you load new partitions each day? or some other factors?).&lt;/P&gt;&lt;P&gt;Creating too many partitions will likely implicitly create too many small files instead of less but bigger files. Processing data which are in more datafiles is less efficient and you can put a stress on the HDFS NameNode (which needs to keep track of the many datafiles) if that is going to be a general trend.&lt;/P&gt;&lt;P&gt;Hope this helps,&lt;/P&gt;&lt;P&gt;&amp;nbsp;Miklos&lt;/P&gt;</description>
      <pubDate>Thu, 07 Sep 2023 08:08:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376094#M242767</guid>
      <dc:creator>mszurap</dc:creator>
      <dc:date>2023-09-07T08:08:34Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices for partition tables in Impala 3.2.0</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376472#M242929</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/106852"&gt;@wcg_hdp_manager&lt;/a&gt;&amp;nbsp;Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.&lt;/P&gt;</description>
      <pubDate>Mon, 18 Sep 2023 21:39:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practices-for-partition-tables-in-Impala-3-2-0/m-p/376472#M242929</guid>
      <dc:creator>DianaTorres</dc:creator>
      <dc:date>2023-09-18T21:39:03Z</dc:date>
    </item>
  </channel>
</rss>

