About gmanche

gmanche · ‎07-30-2018

Thanks for update @cravani. We will keep this reference when we will update to HDP 2.6.5. We are applying the workaround as described to avoid bad stats on our external tables.

gmanche · ‎07-30-2018

Yes sure I'm doing a "MSCK REPAIR TABLE ..." just after creating my table to initiate existing partitions. But I have to add new partitions after new data ingestion in the external table (ie: daily partition). My incremental partition adds is done by "ALTER TABLE ADD PARTITION". Precision : using "MSCK REPAIR TABLE" no stats are initialized even with setting "hive.stats.autogather=true".

gmanche · ‎07-27-2018

Hello, I've encountered a problem using a partitioned Hive external table when statistics are gathered automatically ( hive.stats.autogather is set to true ). When adding partition on an external Hive table when setting hive.stats.autogather=true the partition param numRows is initialized to 0 . Due to this a simple COUNT returns alway 0 from this statistic and no more from a MapReduce job even with hive.fetch.task.conversion=none . Is there a solution to let user make a ADD PARTITION without initializing statistics on external tables even if hive.stats.autogather is set to true ? Reproductive with : HDFS files : $ hdfs dfs -ls -R /user/gmanche/data/ drwx------ - gmanche hdfs 0 2018-07-27 17:18 /user/gmanche/data/part=1 -rw------- 3 gmanche hdfs 59 2018-07-27 17:18 /user/gmanche/data/part=1/data.csv drwx------ - gmanche hdfs 0 2018-07-27 17:19 /user/gmanche/data/part=2 -rw------- 3 gmanche hdfs 39 2018-07-27 17:19 /user/gmanche/data/part=2/data.csv HDFS files content : $ hdfs dfs -text /user/gmanche/data/part=1/data.csv row1-val1,row1-val2 row2-val1,row2-val2 row3-val1,row3-val2 $ hdfs dfs -text /user/gmanche/data/part=2/data.csv row4-val1,row4-val2 row5-val1,row5-val2 Hive SQL scripts : - Initial situation : creating table, hive.stats.autogather=true, row count and select data > DROP TABLE IF EXISTS data; No rows affected (0.923 seconds) > CREATE EXTERNAL TABLE data(col1 string, col2 string) > PARTITIONED BY (part string) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > STORED AS TEXTFILE > LOCATION '/user/gmanche/data'; No rows affected (0.174 seconds) > SET hive.fetch.task.conversion; +----------------------------------+--+ | set | +----------------------------------+--+ | hive.fetch.task.conversion=none | +----------------------------------+--+ 1 row selected (0.028 seconds) > SET hive.stats.autogather=true; No rows affected (0.014 seconds) > SHOW PARTITIONS data; +------------+--+ | partition | +------------+--+ +------------+--+ No rows selected (0.225 seconds) > SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.096 seconds) > SELECT part, COUNT(*) FROM data GROUP BY part; INFO : Session is already open INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1) +-------+------+--+ | part | _c1 | +-------+------+--+ +-------+------+--+ No rows selected (0.706 seconds) > SELECT * FROM data; INFO : Session is already open INFO : Dag name: SELECT * FROM data(Stage-1) +------------+------------+------------+--+ | data.col1 | data.col2 | data.part | +------------+------------+------------+--+ +------------+------------+------------+--+ No rows selected (0.266 seconds) - Adding first partition : add partition one, show partitions, describe partition one, row count and select data > ALTER TABLE data ADD PARTITION (part='1'); No rows affected (0.194 seconds) > SHOW PARTITIONS data; +------------+--+ | partition | +------------+--+ | part=1 | +------------+--+ 1 row selected (0.211 seconds) > DESCRIBE EXTENDED data PARTITION (part='1'); ... parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707145} ... 10 rows selected (0.336 seconds) > SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.091 seconds) > SELECT part, COUNT(*) FROM data GROUP BY part; INFO : Session is already open INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +-------+------+--+ | part | _c1 | +-------+------+--+ | 1 | 3 | +-------+------+--+ 1 row selected (2.791 seconds) > SELECT * FROM data; INFO : Session is already open INFO : Dag name: SELECT * FROM data(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +------------+------------+------------+--+ | data.col1 | data.col2 | data.part | +------------+------------+------------+--+ | row1-val1 | row1-val2 | 1 | | row2-val1 | row2-val2 | 1 | | row3-val1 | row3-val2 | 1 | +------------+------------+------------+--+ 3 rows selected (1.842 seconds) - Adding second partition : add partition two, show partitions, describe partition two, row count and select data > ALTER TABLE data ADD PARTITION (part='2'); No rows affected (0.261 seconds) > SHOW PARTITIONS data; +------------+--+ | partition | +------------+--+ | part=1 | | part=2 | +------------+--+ 2 rows selected (0.345 seconds) > DESCRIBE EXTENDED data PARTITION (part='2'); ... parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707151} ... 10 rows selected (0.314 seconds) > SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.097 seconds) > SELECT part, COUNT(*) FROM data GROUP BY part; INFO : Session is already open INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +-------+------+--+ | part | _c1 | +-------+------+--+ | 1 | 3 | | 2 | 2 | +-------+------+--+ 2 rows selected (4.814 seconds) > SELECT * FROM data; INFO : Session is already open INFO : Dag name: SELECT * FROM data(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009) +------------+------------+------------+--+ | data.col1 | data.col2 | data.part | +------------+------------+------------+--+ | row1-val1 | row1-val2 | 1 | | row2-val1 | row2-val2 | 1 | | row3-val1 | row3-val2 | 1 | | row4-val1 | row4-val2 | 2 | | row5-val1 | row5-val2 | 2 | +------------+------------+------------+--+ 5 rows selected (5.414 seconds) We can see that after creating partition, the partition param numRows is initialized to 0 , the SQL command COUNT returns 0 . I had COUNT to 0 even if I set hive.fetch.task.conversion=none . > SET hive.fetch.task.conversion=none; No rows affected (0.057 seconds) > SELECT COUNT(*) FROM data; +------+--+ | _c0 | +------+--+ | 0 | +------+--+ 1 row selected (0.153 seconds) Counting with a GROUP BY involves to not use statistics, but use the engine TEZ so get the good count. Note : with hive.stats.autogather=false I do not have statistics initialized, so no problem. Gabriel

Online	Offline
Last Visited	‎05-13-2019 08:52 AM

Member Since	‎05-13-2019 05:31 AM
Last Visited	‎05-13-2019 08:52 AM
Posts	3

Cloudera Community

Re: Adding a partition to an external Hive table w...

Re: Adding a partition to an external Hive table w...

Adding a partition to an external Hive table will ...