<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206894#M81324</link>
    <description>&lt;P&gt;Yes sure I'm doing a "MSCK REPAIR TABLE ..." just after creating my table to initiate existing partitions.&lt;BR /&gt;But I have to add new partitions after new data ingestion in the external table (ie: daily partition).&lt;BR /&gt;My incremental partition adds is done by "ALTER TABLE ADD PARTITION".&lt;/P&gt;&lt;P&gt;Precision : using "MSCK REPAIR TABLE" no stats are initialized even with setting "hive.stats.autogather=true".&lt;/P&gt;</description>
    <pubDate>Mon, 30 Jul 2018 17:19:47 GMT</pubDate>
    <dc:creator>gmanche</dc:creator>
    <dc:date>2018-07-30T17:19:47Z</dc:date>
    <item>
      <title>Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206892#M81322</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I've encountered a problem using a partitioned Hive external table when statistics are gathered automatically (&lt;CODE&gt;hive.stats.autogather&lt;/CODE&gt; is set to &lt;CODE&gt;true&lt;/CODE&gt;).&lt;/P&gt;&lt;P&gt;When adding partition on an external Hive table when setting &lt;CODE&gt;hive.stats.autogather=true&lt;/CODE&gt; the partition param &lt;CODE&gt;numRows&lt;/CODE&gt; is initialized to &lt;CODE&gt;0&lt;/CODE&gt;. Due to this a simple &lt;CODE&gt;COUNT&lt;/CODE&gt; returns alway &lt;CODE&gt;0&lt;/CODE&gt; from this statistic and no more from a MapReduce job even with &lt;CODE&gt;hive.fetch.task.conversion=none&lt;/CODE&gt;.&lt;/P&gt;&lt;P&gt;Is there a solution to let user make a &lt;CODE&gt;ADD PARTITION&lt;/CODE&gt; without initializing statistics on external tables even if &lt;CODE&gt;hive.stats.autogather&lt;/CODE&gt; is set to &lt;CODE&gt;true&lt;/CODE&gt; ?&lt;/P&gt;&lt;P&gt;Reproductive with :&lt;/P&gt;&lt;P&gt;HDFS files :&lt;/P&gt;&lt;PRE&gt;$ hdfs dfs -ls -R /user/gmanche/data/
drwx------  - gmanche hdfs  0 2018-07-27 17:18 /user/gmanche/data/part=1
-rw-------  3 gmanche hdfs  59 2018-07-27 17:18 /user/gmanche/data/part=1/data.csv
drwx------  - gmanche hdfs  0 2018-07-27 17:19 /user/gmanche/data/part=2
-rw-------  3 gmanche hdfs  39 2018-07-27 17:19 /user/gmanche/data/part=2/data.csv&lt;/PRE&gt;&lt;P&gt;HDFS files content :&lt;/P&gt;&lt;PRE&gt;$ hdfs dfs -text /user/gmanche/data/part=1/data.csv
row1-val1,row1-val2
row2-val1,row2-val2
row3-val1,row3-val2
$ hdfs dfs -text /user/gmanche/data/part=2/data.csv
row4-val1,row4-val2
row5-val1,row5-val2&lt;/PRE&gt;&lt;P&gt;Hive SQL scripts :&lt;/P&gt;&lt;P&gt;- Initial situation : creating table, hive.stats.autogather=true, row count and select data&lt;/P&gt;&lt;PRE&gt;&amp;gt; DROP TABLE IF EXISTS data;
No rows affected (0.923 seconds)
&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; CREATE EXTERNAL TABLE data(col1 string, col2 string)
&amp;gt; PARTITIONED BY (part string)
&amp;gt; ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
&amp;gt; STORED AS TEXTFILE
&amp;gt; LOCATION '/user/gmanche/data';
No rows affected (0.174 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SET hive.fetch.task.conversion;
+----------------------------------+--+
|               set                |
+----------------------------------+--+
| hive.fetch.task.conversion=none  |
+----------------------------------+--+
1 row selected (0.028 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SET hive.stats.autogather=true;
No rows affected (0.014 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
+------------+--+
No rows selected (0.225 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.096 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
+-------+------+--+
| part  | _c1  |
+-------+------+--+
+-------+------+--+
No rows selected (0.706 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
+------------+------------+------------+--+
No rows selected (0.266 seconds)&lt;/PRE&gt;&lt;P&gt;- Adding first partition : add partition one, show partitions, describe partition one, row count and select data&lt;/P&gt;&lt;PRE&gt;&amp;gt; ALTER TABLE data ADD PARTITION (part='1');
No rows affected (0.194 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
| part=1     |
+------------+--+
1 row selected (0.211 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; DESCRIBE EXTENDED data PARTITION (part='1');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707145}
...
10 rows selected (0.336 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.091 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+-------+------+--+
| part  | _c1  |
+-------+------+--+
| 1     | 3    |
+-------+------+--+
1 row selected (2.791 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
| row1-val1  | row1-val2  | 1          |
| row2-val1  | row2-val2  | 1          |
| row3-val1  | row3-val2  | 1          |
+------------+------------+------------+--+
3 rows selected (1.842 seconds)

&lt;/PRE&gt;
&lt;BR /&gt;
&lt;P&gt;- Adding second partition : add partition two, show partitions, describe partition two, row count and select data&lt;/P&gt;&lt;PRE&gt;&amp;gt; ALTER TABLE data ADD PARTITION (part='2');
No rows affected (0.261 seconds)
&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
| part=1     |
| part=2     |
+------------+--+
2 rows selected (0.345 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; DESCRIBE EXTENDED data PARTITION (part='2');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707151}
...
10 rows selected (0.314 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.097 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+-------+------+--+
| part  | _c1  |
+-------+------+--+
| 1     | 3    |
| 2     | 2    |
+-------+------+--+
2 rows selected (4.814 seconds)&lt;/PRE&gt;
&lt;PRE&gt;&amp;gt; SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)
+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
| row1-val1  | row1-val2  | 1          |
| row2-val1  | row2-val2  | 1          |
| row3-val1  | row3-val2  | 1          |
| row4-val1  | row4-val2  | 2          |
| row5-val1  | row5-val2  | 2          |
+------------+------------+------------+--+
5 rows selected (5.414 seconds)

&lt;/PRE&gt;&lt;P&gt;We can see that after creating partition, the partition param &lt;CODE&gt;numRows&lt;/CODE&gt; is initialized to &lt;CODE&gt;0&lt;/CODE&gt;, the SQL command &lt;CODE&gt;COUNT&lt;/CODE&gt; returns &lt;CODE&gt;0&lt;/CODE&gt;.&lt;/P&gt;I had &lt;CODE&gt;COUNT&lt;/CODE&gt; to &lt;CODE&gt;0&lt;/CODE&gt; even if I set &lt;CODE&gt;hive.fetch.task.conversion=none&lt;/CODE&gt;.&lt;PRE&gt;&amp;gt; SET hive.fetch.task.conversion=none;
No rows affected (0.057 seconds)
&amp;gt; SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.153 seconds)&lt;/PRE&gt;&lt;P&gt;Counting with a &lt;CODE&gt;GROUP BY&lt;/CODE&gt; involves to not use statistics, but use the engine TEZ so get the good count.&lt;/P&gt;&lt;P&gt;Note : with &lt;CODE&gt;hive.stats.autogather=false&lt;/CODE&gt; I do not have statistics initialized, so no problem.&lt;/P&gt;&lt;P&gt;Gabriel&lt;/P&gt;</description>
      <pubDate>Sat, 28 Jul 2018 03:33:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206892#M81322</guid>
      <dc:creator>gmanche</dc:creator>
      <dc:date>2018-07-28T03:33:24Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206893#M81323</link>
      <description>&lt;P&gt;After create external table with location, can you run "msck repair table data" ?&lt;/P&gt;&lt;P&gt;It should automatically update partition information from folder path to hive metadata.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jul 2018 12:37:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206893#M81323</guid>
      <dc:creator>nramanaiah</dc:creator>
      <dc:date>2018-07-30T12:37:26Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206894#M81324</link>
      <description>&lt;P&gt;Yes sure I'm doing a "MSCK REPAIR TABLE ..." just after creating my table to initiate existing partitions.&lt;BR /&gt;But I have to add new partitions after new data ingestion in the external table (ie: daily partition).&lt;BR /&gt;My incremental partition adds is done by "ALTER TABLE ADD PARTITION".&lt;/P&gt;&lt;P&gt;Precision : using "MSCK REPAIR TABLE" no stats are initialized even with setting "hive.stats.autogather=true".&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jul 2018 17:19:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206894#M81324</guid>
      <dc:creator>gmanche</dc:creator>
      <dc:date>2018-07-30T17:19:47Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206895#M81325</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/90172/gabrielmanche.html" nodeid="90172"&gt;@Gabriel MANCHE&lt;/A&gt; This is a known issue with Hive tracked in Apache JIRA &lt;A href="https://issues.apache.org/jira/browse/HIVE-11266"&gt;HIVE-11266&lt;/A&gt; , the workaround is to set hive.stats.autogather=false before querying external tables which involves basic stats.&lt;/P&gt;&lt;P&gt;Also, the Bug is fixed in HDP 2.6.5.&lt;/P&gt;&lt;P&gt;PS: Please accept the answer if you feel it is correct.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jul 2018 19:38:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206895#M81325</guid>
      <dc:creator>cravani</dc:creator>
      <dc:date>2018-07-30T19:38:18Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206896#M81326</link>
      <description>&lt;P&gt;Thanks for update &lt;A rel="user" href="https://community.cloudera.com/users/15788/cravani.html" nodeid="15788"&gt;@cravani&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;We will keep this reference when we will update to HDP 2.6.5.&lt;/P&gt;&lt;P&gt;We are applying the workaround as described to avoid bad stats on our external tables.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jul 2018 19:43:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Adding-a-partition-to-an-external-Hive-table-will-cause/m-p/206896#M81326</guid>
      <dc:creator>gmanche</dc:creator>
      <dc:date>2018-07-30T19:43:54Z</dc:date>
    </item>
  </channel>
</rss>

