Created 07-27-2018 08:33 PM
Hello,
I've encountered a problem using a partitioned Hive external table when statistics are gathered automatically (hive.stats.autogather
is set to true
).
When adding partition on an external Hive table when setting hive.stats.autogather=true
the partition param numRows
is initialized to 0
. Due to this a simple COUNT
returns alway 0
from this statistic and no more from a MapReduce job even with hive.fetch.task.conversion=none
.
Is there a solution to let user make a ADD PARTITION
without initializing statistics on external tables even if hive.stats.autogather
is set to true
?
Reproductive with :
HDFS files :
$ hdfs dfs -ls -R /user/gmanche/data/ drwx------ - gmanche hdfs 0 2018-07-27 17:18 /user/gmanche/data/part=1 -rw------- 3 gmanche hdfs 59 2018-07-27 17:18 /user/gmanche/data/part=1/data.csv drwx------ - gmanche hdfs 0 2018-07-27 17:19 /user/gmanche/data/part=2 -rw------- 3 gmanche hdfs 39 2018-07-27 17:19 /user/gmanche/data/part=2/data.csv
HDFS files content :
$ hdfs dfs -text /user/gmanche/data/part=1/data.csv row1-val1,row1-val2 row2-val1,row2-val2 row3-val1,row3-val2 $ hdfs dfs -text /user/gmanche/data/part=2/data.csv row4-val1,row4-val2 row5-val1,row5-val2
Hive SQL scripts :
- Initial situation : creating table, hive.stats.autogather=true, row count and select data
> DROP TABLE IF EXISTS data; No rows affected (0.923 seconds)
> CREATE EXTERNAL TABLE data(col1 string, col2 string) > PARTITIONED BY (part string) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > STORED AS TEXTFILE > LOCATION '/user/gmanche/data'; No rows affected (0.174 seconds)
> SET hive.fetch.task.conversion; +----------------------------------+--+ | set | +----------------------------------+--+ | hive.fetch.task.conversion=none | +----------------------------------+--+ 1 row selected (0.028 seconds)
> SET hive.stats.autogather=true; No rows affected (0.014 seconds)
> SHOW PARTITIONS data; +------------+--+ | partition | +------------+--+ +------------+--+ No rows selected (0.225 seconds)
> SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.096 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part; INFO : Session is already open INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1) +-------+------+--+ | part | _c1 | +-------+------+--+ +-------+------+--+ No rows selected (0.706 seconds)
> SELECT * FROM data; INFO : Session is already open INFO : Dag name: SELECT * FROM data(Stage-1) +------------+------------+------------+--+ | data.col1 | data.col2 | data.part | +------------+------------+------------+--+ +------------+------------+------------+--+ No rows selected (0.266 seconds)
- Adding first partition : add partition one, show partitions, describe partition one, row count and select data
> ALTER TABLE data ADD PARTITION (part='1'); No rows affected (0.194 seconds)
> SHOW PARTITIONS data; +------------+--+ | partition | +------------+--+ | part=1 | +------------+--+ 1 row selected (0.211 seconds)
> DESCRIBE EXTENDED data PARTITION (part='1'); ... parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707145} ... 10 rows selected (0.336 seconds)
> SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.091 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part; INFO : Session is already open INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +-------+------+--+ | part | _c1 | +-------+------+--+ | 1 | 3 | +-------+------+--+ 1 row selected (2.791 seconds)
> SELECT * FROM data; INFO : Session is already open INFO : Dag name: SELECT * FROM data(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +------------+------------+------------+--+ | data.col1 | data.col2 | data.part | +------------+------------+------------+--+ | row1-val1 | row1-val2 | 1 | | row2-val1 | row2-val2 | 1 | | row3-val1 | row3-val2 | 1 | +------------+------------+------------+--+ 3 rows selected (1.842 seconds)
- Adding second partition : add partition two, show partitions, describe partition two, row count and select data
> ALTER TABLE data ADD PARTITION (part='2'); No rows affected (0.261 seconds)
> SHOW PARTITIONS data; +------------+--+ | partition | +------------+--+ | part=1 | | part=2 | +------------+--+ 2 rows selected (0.345 seconds)
> DESCRIBE EXTENDED data PARTITION (part='2'); ... parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707151} ... 10 rows selected (0.314 seconds)
> SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.097 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part; INFO : Session is already open INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +-------+------+--+ | part | _c1 | +-------+------+--+ | 1 | 3 | | 2 | 2 | +-------+------+--+ 2 rows selected (4.814 seconds)
> SELECT * FROM data; INFO : Session is already open INFO : Dag name: SELECT * FROM data(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +------------+------------+------------+--+ | data.col1 | data.col2 | data.part | +------------+------------+------------+--+ | row1-val1 | row1-val2 | 1 | | row2-val1 | row2-val2 | 1 | | row3-val1 | row3-val2 | 1 | | row4-val1 | row4-val2 | 2 | | row5-val1 | row5-val2 | 2 | +------------+------------+------------+--+ 5 rows selected (5.414 seconds)
We can see that after creating partition, the partition param numRows
is initialized to 0
, the SQL command COUNT
returns 0
.
COUNT
to 0
even if I set hive.fetch.task.conversion=none
.> SET hive.fetch.task.conversion=none; No rows affected (0.057 seconds) > SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.153 seconds)
Counting with a GROUP BY
involves to not use statistics, but use the engine TEZ so get the good count.
Note : with hive.stats.autogather=false
I do not have statistics initialized, so no problem.
Gabriel
Created 07-30-2018 12:38 PM
@Gabriel MANCHE This is a known issue with Hive tracked in Apache JIRA HIVE-11266 , the workaround is to set hive.stats.autogather=false before querying external tables which involves basic stats.
Also, the Bug is fixed in HDP 2.6.5.
PS: Please accept the answer if you feel it is correct.
Created 07-30-2018 05:37 AM
After create external table with location, can you run "msck repair table data" ?
It should automatically update partition information from folder path to hive metadata.
Created 07-30-2018 10:19 AM
Yes sure I'm doing a "MSCK REPAIR TABLE ..." just after creating my table to initiate existing partitions.
But I have to add new partitions after new data ingestion in the external table (ie: daily partition).
My incremental partition adds is done by "ALTER TABLE ADD PARTITION".
Precision : using "MSCK REPAIR TABLE" no stats are initialized even with setting "hive.stats.autogather=true".
Created 07-30-2018 12:38 PM
@Gabriel MANCHE This is a known issue with Hive tracked in Apache JIRA HIVE-11266 , the workaround is to set hive.stats.autogather=false before querying external tables which involves basic stats.
Also, the Bug is fixed in HDP 2.6.5.
PS: Please accept the answer if you feel it is correct.
Created 07-30-2018 12:43 PM
Thanks for update @cravani.
We will keep this reference when we will update to HDP 2.6.5.
We are applying the workaround as described to avoid bad stats on our external tables.