<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Hive query plan not using generated statistics in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384834#M245507</link>
    <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="screenshot.PNG" style="width: 828px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40035i65B61A4368DECC20/image-size/large?v=v2&amp;amp;px=999" role="button" title="screenshot.PNG" alt="screenshot.PNG" /&gt;&lt;/span&gt;I have a parquet table named 'cats' holding 2 columns&amp;nbsp;(day STRING, number INT). Statistics are generated both for the table and the 2 columns (e.g. max, min, distinct can be seen in DESCRIBE FORMATTED default.cats number).&lt;/P&gt;&lt;P&gt;Still, when running an explain command for a select max, it looks like the statistics aren't being used. Why is this? How can Hive be configured to use the stats?&lt;/P&gt;&lt;P&gt;All the following settings are true:&lt;BR /&gt;set hive.cbo.enable;&lt;BR /&gt;set hive.stats.autogather;&lt;BR /&gt;set hive.compute.query.using.stats;&lt;BR /&gt;set hive.stats.fetch.column.stats;&lt;BR /&gt;set hive.stats.fetch.partition.stats;&lt;/P&gt;&lt;P&gt;explain select max(number) from default.cats;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 11 Mar 2024 13:22:43 GMT</pubDate>
    <dc:creator>Leopold</dc:creator>
    <dc:date>2024-03-11T13:22:43Z</dc:date>
    <item>
      <title>Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384834#M245507</link>
      <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="screenshot.PNG" style="width: 828px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/40035i65B61A4368DECC20/image-size/large?v=v2&amp;amp;px=999" role="button" title="screenshot.PNG" alt="screenshot.PNG" /&gt;&lt;/span&gt;I have a parquet table named 'cats' holding 2 columns&amp;nbsp;(day STRING, number INT). Statistics are generated both for the table and the 2 columns (e.g. max, min, distinct can be seen in DESCRIBE FORMATTED default.cats number).&lt;/P&gt;&lt;P&gt;Still, when running an explain command for a select max, it looks like the statistics aren't being used. Why is this? How can Hive be configured to use the stats?&lt;/P&gt;&lt;P&gt;All the following settings are true:&lt;BR /&gt;set hive.cbo.enable;&lt;BR /&gt;set hive.stats.autogather;&lt;BR /&gt;set hive.compute.query.using.stats;&lt;BR /&gt;set hive.stats.fetch.column.stats;&lt;BR /&gt;set hive.stats.fetch.partition.stats;&lt;/P&gt;&lt;P&gt;explain select max(number) from default.cats;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Mar 2024 13:22:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384834#M245507</guid>
      <dc:creator>Leopold</dc:creator>
      <dc:date>2024-03-11T13:22:43Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384860#M245524</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/109565"&gt;@Leopold&lt;/a&gt;&amp;nbsp;Welcome to the Cloudera Community!&lt;BR /&gt;&lt;BR /&gt;To help you get the best possible solution, I have tagged our Hive experts&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/12885"&gt;@mszurap&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/38161"&gt;@cravani&lt;/a&gt;&amp;nbsp; who may be able to assist you further.&lt;BR /&gt;&lt;BR /&gt;Please keep us updated on your post, and we hope you find a satisfactory solution to your query.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Mar 2024 20:54:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384860#M245524</guid>
      <dc:creator>DianaTorres</dc:creator>
      <dc:date>2024-03-11T20:54:07Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384885#M245539</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/109565"&gt;@Leopold&lt;/a&gt;&amp;nbsp;provided we have column stats available, Hive could use a fetch task to perform a simple aggregation task such as max(), instead of launching a Map job.&lt;/P&gt;&lt;P&gt;Try hive.fetch.task.aggr=true .&lt;/P&gt;&lt;P&gt;This property is disabled by default.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Mar 2024 09:38:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384885#M245539</guid>
      <dc:creator>smruti</dc:creator>
      <dc:date>2024-03-12T09:38:43Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384890#M245544</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/82698"&gt;@smruti&lt;/a&gt;&amp;nbsp;Column stats are available indeed: &lt;SPAN&gt;{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"number\":\"true\"}}.&lt;BR /&gt;I extended the table size to 1gig+ (since hive.fetch.task.conversion.threshold=1073741824), and tried&amp;nbsp;hive.fetch.task.aggr=true before doing the explain select for the max. It is unfortunately still doing a full TableScan as seen in the screenshot above. I would like to remark as well that we are talking about an external table here. When I put the same data in a managed table, the query plan makes use of stats as expected.&lt;BR /&gt;So in summary, I don't understand why stats arent being used for this external table.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Mar 2024 10:18:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384890#M245544</guid>
      <dc:creator>Leopold</dc:creator>
      <dc:date>2024-03-12T10:18:58Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384896#M245547</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/109565"&gt;@Leopold&lt;/a&gt;&amp;nbsp;I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;2024-03-12 10:48:37,247 INFO  org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer.&lt;/LI-CODE&gt;&lt;P&gt;But enabling&amp;nbsp; "&lt;SPAN&gt;hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Mar 2024 11:00:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384896#M245547</guid>
      <dc:creator>smruti</dc:creator>
      <dc:date>2024-03-12T11:00:12Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384897#M245548</link>
      <description>&lt;P&gt;Thank you&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/82698"&gt;@smruti&lt;/a&gt;. Is there no way to enable the use of statistics for external tables in Hive?&lt;/P&gt;</description>
      <pubDate>Tue, 12 Mar 2024 11:29:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384897#M245548</guid>
      <dc:creator>Leopold</dc:creator>
      <dc:date>2024-03-12T11:29:35Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384903#M245554</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/109565"&gt;@Leopold&lt;/a&gt;&amp;nbsp;It is disabled for external tables as data in HDFS can change without Hive knowing about it.&amp;nbsp; Unfortunately I do not see a way to enforce fetch task for a query with an aggregate function.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Mar 2024 14:46:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384903#M245554</guid>
      <dc:creator>smruti</dc:creator>
      <dc:date>2024-03-12T14:46:42Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384951#M245576</link>
      <description>&lt;P&gt;It looks indeed that Hive&amp;nbsp;&lt;SPAN&gt;StatsOptimizer seems to exclude external tables from getting their queries answered entirely by stats. So this rules out usage of fetch stats for simple aggregation queries such as min, max, etc.&lt;BR /&gt;However, the optimizer can also use stats for query plan optimizing (e.g. join reordering) in complex queries. Are external tables excluded from these optimizations as well?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Mar 2024 13:49:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/384951#M245576</guid>
      <dc:creator>Leopold</dc:creator>
      <dc:date>2024-03-13T13:49:53Z</dc:date>
    </item>
    <item>
      <title>Re: Hive query plan not using generated statistics</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/385035#M245597</link>
      <description>&lt;P&gt;Hive does use stats from an external table in preparing query plan. When stats are accurate, it could&lt;SPAN&gt;&amp;nbsp;estimate the size of intermediate data sets and select efficient join strategies. The only thing I noticed is the fetch task is not working.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 15 Mar 2024 06:58:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-query-plan-not-using-generated-statistics/m-p/385035#M245597</guid>
      <dc:creator>smruti</dc:creator>
      <dc:date>2024-03-15T06:58:14Z</dc:date>
    </item>
  </channel>
</rss>

