Support Questions

Find answers, ask questions, and share your expertise

Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true

avatar
New Contributor

Hello,

I've encountered a problem using a partitioned Hive external table when statistics are gathered automatically (hive.stats.autogather is set to true).

When adding partition on an external Hive table when setting hive.stats.autogather=true the partition param numRows is initialized to 0. Due to this a simple COUNT returns alway 0 from this statistic and no more from a MapReduce job even with hive.fetch.task.conversion=none.

Is there a solution to let user make a ADD PARTITION without initializing statistics on external tables even if hive.stats.autogather is set to true ?

Reproductive with :

HDFS files :

$ hdfs dfs -ls -R /user/gmanche/data/
drwx------  - gmanche hdfs  0 2018-07-27 17:18 /user/gmanche/data/part=1
-rw-------  3 gmanche hdfs  59 2018-07-27 17:18 /user/gmanche/data/part=1/data.csv
drwx------  - gmanche hdfs  0 2018-07-27 17:19 /user/gmanche/data/part=2
-rw-------  3 gmanche hdfs  39 2018-07-27 17:19 /user/gmanche/data/part=2/data.csv

HDFS files content :

$ hdfs dfs -text /user/gmanche/data/part=1/data.csv
row1-val1,row1-val2
row2-val1,row2-val2
row3-val1,row3-val2
$ hdfs dfs -text /user/gmanche/data/part=2/data.csv
row4-val1,row4-val2
row5-val1,row5-val2

Hive SQL scripts :

- Initial situation : creating table, hive.stats.autogather=true, row count and select data

> DROP TABLE IF EXISTS data;
No rows affected (0.923 seconds)
> CREATE EXTERNAL TABLE data(col1 string, col2 string)
> PARTITIONED BY (part string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/user/gmanche/data';
No rows affected (0.174 seconds)
> SET hive.fetch.task.conversion;
+----------------------------------+--+
|               set                |
+----------------------------------+--+
| hive.fetch.task.conversion=none  |
+----------------------------------+--+
1 row selected (0.028 seconds)
> SET hive.stats.autogather=true;
No rows affected (0.014 seconds)
> SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
+------------+--+
No rows selected (0.225 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.096 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
+-------+------+--+
| part  | _c1  |
+-------+------+--+
+-------+------+--+
No rows selected (0.706 seconds)
> SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
+------------+------------+------------+--+
No rows selected (0.266 seconds)

- Adding first partition : add partition one, show partitions, describe partition one, row count and select data

> ALTER TABLE data ADD PARTITION (part='1');
No rows affected (0.194 seconds)
> SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
| part=1     |
+------------+--+
1 row selected (0.211 seconds)
> DESCRIBE EXTENDED data PARTITION (part='1');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707145}
...
10 rows selected (0.336 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.091 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+-------+------+--+
| part  | _c1  |
+-------+------+--+
| 1     | 3    |
+-------+------+--+
1 row selected (2.791 seconds)
> SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
| row1-val1  | row1-val2  | 1          |
| row2-val1  | row2-val2  | 1          |
| row3-val1  | row3-val2  | 1          |
+------------+------------+------------+--+
3 rows selected (1.842 seconds)


- Adding second partition : add partition two, show partitions, describe partition two, row count and select data

> ALTER TABLE data ADD PARTITION (part='2');
No rows affected (0.261 seconds)
> SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
| part=1     |
| part=2     |
+------------+--+
2 rows selected (0.345 seconds)
> DESCRIBE EXTENDED data PARTITION (part='2');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707151}
...
10 rows selected (0.314 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.097 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+-------+------+--+
| part  | _c1  |
+-------+------+--+
| 1     | 3    |
| 2     | 2    |
+-------+------+--+
2 rows selected (4.814 seconds)
> SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)
+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
| row1-val1  | row1-val2  | 1          |
| row2-val1  | row2-val2  | 1          |
| row3-val1  | row3-val2  | 1          |
| row4-val1  | row4-val2  | 2          |
| row5-val1  | row5-val2  | 2          |
+------------+------------+------------+--+
5 rows selected (5.414 seconds)

We can see that after creating partition, the partition param numRows is initialized to 0, the SQL command COUNT returns 0.

I had COUNT to 0 even if I set hive.fetch.task.conversion=none.
> SET hive.fetch.task.conversion=none;
No rows affected (0.057 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.153 seconds)

Counting with a GROUP BY involves to not use statistics, but use the engine TEZ so get the good count.

Note : with hive.stats.autogather=false I do not have statistics initialized, so no problem.

Gabriel

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Gabriel MANCHE This is a known issue with Hive tracked in Apache JIRA HIVE-11266 , the workaround is to set hive.stats.autogather=false before querying external tables which involves basic stats.

Also, the Bug is fixed in HDP 2.6.5.

PS: Please accept the answer if you feel it is correct.

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

After create external table with location, can you run "msck repair table data" ?

It should automatically update partition information from folder path to hive metadata.

avatar
New Contributor

Yes sure I'm doing a "MSCK REPAIR TABLE ..." just after creating my table to initiate existing partitions.
But I have to add new partitions after new data ingestion in the external table (ie: daily partition).
My incremental partition adds is done by "ALTER TABLE ADD PARTITION".

Precision : using "MSCK REPAIR TABLE" no stats are initialized even with setting "hive.stats.autogather=true".

avatar
Super Collaborator

@Gabriel MANCHE This is a known issue with Hive tracked in Apache JIRA HIVE-11266 , the workaround is to set hive.stats.autogather=false before querying external tables which involves basic stats.

Also, the Bug is fixed in HDP 2.6.5.

PS: Please accept the answer if you feel it is correct.

avatar
New Contributor

Thanks for update @cravani.

We will keep this reference when we will update to HDP 2.6.5.

We are applying the workaround as described to avoid bad stats on our external tables.