Member since
05-13-2019
3
Posts
0
Kudos Received
0
Solutions
07-30-2018
12:43 PM
Thanks for update @cravani. We will keep this reference when we will update to HDP 2.6.5. We are applying the workaround as described to avoid bad stats on our external tables.
... View more
07-30-2018
10:19 AM
Yes sure I'm doing a "MSCK REPAIR TABLE ..." just after creating my table to initiate existing partitions. But I have to add new partitions after new data ingestion in the external table (ie: daily partition). My incremental partition adds is done by "ALTER TABLE ADD PARTITION". Precision : using "MSCK REPAIR TABLE" no stats are initialized even with setting "hive.stats.autogather=true".
... View more
07-27-2018
08:33 PM
Hello, I've encountered a problem using a partitioned Hive external table when statistics are gathered automatically ( hive.stats.autogather is set to true ). When adding partition on an external Hive table when setting hive.stats.autogather=true the partition param numRows is initialized to 0 . Due to this a simple COUNT returns alway 0 from this statistic and no more from a MapReduce job even with hive.fetch.task.conversion=none . Is there a solution to let user make a ADD PARTITION without initializing statistics on external tables even if hive.stats.autogather is set to true ? Reproductive with : HDFS files : $ hdfs dfs -ls -R /user/gmanche/data/
drwx------ - gmanche hdfs 0 2018-07-27 17:18 /user/gmanche/data/part=1
-rw------- 3 gmanche hdfs 59 2018-07-27 17:18 /user/gmanche/data/part=1/data.csv
drwx------ - gmanche hdfs 0 2018-07-27 17:19 /user/gmanche/data/part=2
-rw------- 3 gmanche hdfs 39 2018-07-27 17:19 /user/gmanche/data/part=2/data.csv HDFS files content : $ hdfs dfs -text /user/gmanche/data/part=1/data.csv
row1-val1,row1-val2
row2-val1,row2-val2
row3-val1,row3-val2
$ hdfs dfs -text /user/gmanche/data/part=2/data.csv
row4-val1,row4-val2
row5-val1,row5-val2 Hive SQL scripts : - Initial situation : creating table, hive.stats.autogather=true, row count and select data > DROP TABLE IF EXISTS data;
No rows affected (0.923 seconds)
> CREATE EXTERNAL TABLE data(col1 string, col2 string)
> PARTITIONED BY (part string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/user/gmanche/data';
No rows affected (0.174 seconds)
> SET hive.fetch.task.conversion;
+----------------------------------+--+
| set |
+----------------------------------+--+
| hive.fetch.task.conversion=none |
+----------------------------------+--+
1 row selected (0.028 seconds)
> SET hive.stats.autogather=true;
No rows affected (0.014 seconds)
> SHOW PARTITIONS data;
+------------+--+
| partition |
+------------+--+
+------------+--+
No rows selected (0.225 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0 |
+------+--+
| 0 |
+------+--+
1 row selected (0.096 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO : Session is already open
INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
+-------+------+--+
| part | _c1 |
+-------+------+--+
+-------+------+--+
No rows selected (0.706 seconds)
> SELECT * FROM data;
INFO : Session is already open
INFO : Dag name: SELECT * FROM data(Stage-1)
+------------+------------+------------+--+
| data.col1 | data.col2 | data.part |
+------------+------------+------------+--+
+------------+------------+------------+--+
No rows selected (0.266 seconds) - Adding first partition : add partition one, show partitions, describe partition one, row count and select data > ALTER TABLE data ADD PARTITION (part='1');
No rows affected (0.194 seconds)
> SHOW PARTITIONS data;
+------------+--+
| partition |
+------------+--+
| part=1 |
+------------+--+
1 row selected (0.211 seconds)
> DESCRIBE EXTENDED data PARTITION (part='1');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707145}
...
10 rows selected (0.336 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0 |
+------+--+
| 0 |
+------+--+
1 row selected (0.091 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO : Session is already open
INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)
+-------+------+--+
| part | _c1 |
+-------+------+--+
| 1 | 3 |
+-------+------+--+
1 row selected (2.791 seconds)
> SELECT * FROM data;
INFO : Session is already open
INFO : Dag name: SELECT * FROM data(Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)
+------------+------------+------------+--+
| data.col1 | data.col2 | data.part |
+------------+------------+------------+--+
| row1-val1 | row1-val2 | 1 |
| row2-val1 | row2-val2 | 1 |
| row3-val1 | row3-val2 | 1 |
+------------+------------+------------+--+
3 rows selected (1.842 seconds)
- Adding second partition : add partition two, show partitions, describe partition two, row count and select data > ALTER TABLE data ADD PARTITION (part='2');
No rows affected (0.261 seconds)
> SHOW PARTITIONS data;
+------------+--+
| partition |
+------------+--+
| part=1 |
| part=2 |
+------------+--+
2 rows selected (0.345 seconds)
> DESCRIBE EXTENDED data PARTITION (part='2');
...
parameters:{totalSize=0, rawDataSize=0, numRows=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1532707151}
...
10 rows selected (0.314 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0 |
+------+--+
| 0 |
+------+--+
1 row selected (0.097 seconds)
> SELECT part, COUNT(*) FROM data GROUP BY part;
INFO : Session is already open
INFO : Dag name: SELECT part, COUNT(*) FROM data GROUP...part(Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)
+-------+------+--+
| part | _c1 |
+-------+------+--+
| 1 | 3 |
| 2 | 2 |
+-------+------+--+
2 rows selected (4.814 seconds)
> SELECT * FROM data;
INFO : Session is already open
INFO : Dag name: SELECT * FROM data(Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id application_1532695374365_1009)
+------------+------------+------------+--+
| data.col1 | data.col2 | data.part |
+------------+------------+------------+--+
| row1-val1 | row1-val2 | 1 |
| row2-val1 | row2-val2 | 1 |
| row3-val1 | row3-val2 | 1 |
| row4-val1 | row4-val2 | 2 |
| row5-val1 | row5-val2 | 2 |
+------------+------------+------------+--+
5 rows selected (5.414 seconds)
We can see that after creating partition, the partition param numRows is initialized to 0 , the SQL command COUNT returns 0 . I had COUNT to 0 even if I set hive.fetch.task.conversion=none . > SET hive.fetch.task.conversion=none;
No rows affected (0.057 seconds)
> SELECT COUNT(*) FROM data;
+------+--+
| _c0 |
+------+--+
| 0 |
+------+--+
1 row selected (0.153 seconds) Counting with a GROUP BY involves to not use statistics, but use the engine TEZ so get the good count. Note : with hive.stats.autogather=false I do not have statistics initialized, so no problem. Gabriel
... View more
Labels:
- Labels:
-
Apache Hive