Support Questions

gmanche · ‎07-27-2018

Hello,

I've encountered a problem using a partitioned Hive external table when statistics are gathered automatically (hive.stats.autogather is set to true).

When adding partition on an external Hive table when setting hive.stats.autogather=true the partition param numRows is initialized to 0. Due to this a simple COUNT returns alway 0 from this statistic and no more from a MapReduce job even with hive.fetch.task.conversion=none.

Is there a solution to let user make a ADD PARTITION without initializing statistics on external tables even if hive.stats.autogather is set to true ?

Reproductive with :

HDFS files :

$ hdfs dfs -ls -R /user/gmanche/data/
drwx------  - gmanche hdfs  0 2018-07-27 17:18 /user/gmanche/data/part=1
-rw-------  3 gmanche hdfs  59 2018-07-27 17:18 /user/gmanche/data/part=1/data.csv
drwx------  - gmanche hdfs  0 2018-07-27 17:19 /user/gmanche/data/part=2
-rw-------  3 gmanche hdfs  39 2018-07-27 17:19 /user/gmanche/data/part=2/data.csv

HDFS files content :

$ hdfs dfs -text /user/gmanche/data/part=1/data.csv
row1-val1,row1-val2
row2-val1,row2-val2
row3-val1,row3-val2
$ hdfs dfs -text /user/gmanche/data/part=2/data.csv
row4-val1,row4-val2
row5-val1,row5-val2

Hive SQL scripts :

- Initial situation : creating table, hive.stats.autogather=true, row count and select data

> DROP TABLE IF EXISTS data;
No rows affected (0.923 seconds)

> CREATE EXTERNAL TABLE data(col1 string, col2 string)
> PARTITIONED BY (part string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/user/gmanche/data';
No rows affected (0.174 seconds)

> SET hive.fetch.task.conversion;
+----------------------------------+--+
|               set                |
+----------------------------------+--+
| hive.fetch.task.conversion=none  |
+----------------------------------+--+
1 row selected (0.028 seconds)

> SET hive.stats.autogather=true;
No rows affected (0.014 seconds)

> SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
+------------+--+
No rows selected (0.225 seconds)

> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.096 seconds)

> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
+-------+------+--+
| part  | _c1  |
+-------+------+--+
+-------+------+--+
No rows selected (0.706 seconds)

> SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
+------------+------------+------------+--+
No rows selected (0.266 seconds)

- Adding first partition : add partition one, show partitions, describe partition one, row count and select data

> ALTER TABLE data ADD PARTITION (part='1');
No rows affected (0.194 seconds)

> SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
| part=1     |
+------------+--+
1 row selected (0.211 seconds)

> DESCRIBE EXTENDED data PARTITION (part='1');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707145}
...
10 rows selected (0.336 seconds)

> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.091 seconds)

> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+-------+------+--+
| part  | _c1  |
+-------+------+--+
| 1     | 3    |
+-------+------+--+
1 row selected (2.791 seconds)

> SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
| row1-val1  | row1-val2  | 1          |
| row2-val1  | row2-val2  | 1          |
| row3-val1  | row3-val2  | 1          |
+------------+------------+------------+--+
3 rows selected (1.842 seconds)

- Adding second partition : add partition two, show partitions, describe partition two, row count and select data

> ALTER TABLE data ADD PARTITION (part='2');
No rows affected (0.261 seconds)

> SHOW PARTITIONS data;
+------------+--+
| partition  |
+------------+--+
| part=1     |
| part=2     |
+------------+--+
2 rows selected (0.345 seconds)

> DESCRIBE EXTENDED data PARTITION (part='2');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707151}
...
10 rows selected (0.314 seconds)

> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.097 seconds)

> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO  : Session is already open
INFO  : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)

+-------+------+--+
| part  | _c1  |
+-------+------+--+
| 1     | 3    |
| 2     | 2    |
+-------+------+--+
2 rows selected (4.814 seconds)

> SELECT * FROM data;
INFO  : Session is already open
INFO  : Dag name: SELECT * FROM data(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)
+------------+------------+------------+--+
| data.col1  | data.col2  | data.part  |
+------------+------------+------------+--+
| row1-val1  | row1-val2  | 1          |
| row2-val1  | row2-val2  | 1          |
| row3-val1  | row3-val2  | 1          |
| row4-val1  | row4-val2  | 2          |
| row5-val1  | row5-val2  | 2          |
+------------+------------+------------+--+
5 rows selected (5.414 seconds)

We can see that after creating partition, the partition param numRows is initialized to 0, the SQL command COUNT returns 0.

I had COUNT to 0 even if I set hive.fetch.task.conversion=none.

> SET hive.fetch.task.conversion=none;
No rows affected (0.057 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
1 row selected (0.153 seconds)

Counting with a GROUP BY involves to not use statistics, but use the engine TEZ so get the good count.

Note : with hive.stats.autogather=false I do not have statistics initialized, so no problem.

Gabriel

cravani · ‎07-30-2018

@Gabriel MANCHE This is a known issue with Hive tracked in Apache JIRA HIVE-11266 , the workaround is to set hive.stats.autogather=false before querying external tables which involves basic stats.

Also, the Bug is fixed in HDP 2.6.5.

PS: Please accept the answer if you feel it is correct.

View solution in original post

nramanaiah · ‎07-30-2018

After create external table with location, can you run "msck repair table data" ?

It should automatically update partition information from folder path to hive metadata.

gmanche · ‎07-30-2018

Yes sure I'm doing a "MSCK REPAIR TABLE ..." just after creating my table to initiate existing partitions.
But I have to add new partitions after new data ingestion in the external table (ie: daily partition).
My incremental partition adds is done by "ALTER TABLE ADD PARTITION".

Precision : using "MSCK REPAIR TABLE" no stats are initialized even with setting "hive.stats.autogather=true".

cravani · ‎07-30-2018

@Gabriel MANCHE This is a known issue with Hive tracked in Apache JIRA HIVE-11266 , the workaround is to set hive.stats.autogather=false before querying external tables which involves basic stats.

Also, the Bug is fixed in HDP 2.6.5.

PS: Please accept the answer if you feel it is correct.

gmanche · ‎07-30-2018

Thanks for update @cravani.

We will keep this reference when we will update to HDP 2.6.5.

We are applying the workaround as described to avoid bad stats on our external tables.

Cloudera Community

Support Questions

Adding a partition to an external Hive table will cause COUNT returns 0 when property hive.stats.autogather set to true