Created on 07-11-2015 05:21 AM - edited 09-16-2022 02:33 AM
Using Cloudera Express 5.4.1, I have created a Snappy compressed Avro Hive table partitioned by year, month, day, hour:
set hive.exec.compress.output=true; set avro.output.codec=snappy; CREATE EXTERNAL TABLE my_table PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://localhost:8020/my-folder/' TBLPROPERTIES ('avro.schema.url'='hdfs://localhost/avro/my_table.avsc'); ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=1, hour=0); INVALIDATE METADATA;
Then using the impala-shell, I compute the stats for a partition:
REFRESH my_table; COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=1, hour=0);
The summary returns as follows:
+------------------------------------------+ | summary | +------------------------------------------+ | Updated 0 partition(s) and 46 column(s). | +------------------------------------------+
Then I run SHOW TABLE STATS my_table; I get the following:
+-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+ | year | month | day | hour | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+ | 2015 | 7 | 1 | 0 | -1 | 259 | 15.58GB | NOT CACHED | NOT CACHED | AVRO | false |
This is showing that the stats are not updated. Furthermore, when I run an Impala query and look at the profile I see the following:
WARNING: The following tables are missing relevant table and/or column statistics. default.my_table
I am not sure this is because my Avro files are SNAPPY compressed and Impala is unable to COMPUTE STATS on Hive tables with compressed Avro files.
Any help would be greatly appreciated - thanks!
Created 07-17-2015 12:11 AM
Thanks for the update. I can reproduce the issue, but only when the target partition is empty. As soon as I add some data, compute incremental stats works as expected.
So I'm still thinking you are hitting an edge case with an empty partition?
Created 07-13-2015 06:53 PM
I cannot say for sure what is going wrong here, but I suspect that you are hitting an edge case for an empty table/partition.
Have you tried the same thing with a non-empty table/partition?
Created 07-14-2015 06:39 AM
Created 07-14-2015 10:56 AM
Impala is able to handle snappy compressed Avro files, so I don't think that's the problem.
You may be hitting https://issues.apache.org/jira/browse/HIVE-6308 since you created the tables through Hive without column defiitions.
You could try to create the tables through Impala, or create them with Hive but with column definitions.
Created on 07-15-2015 11:39 AM - edited 07-15-2015 06:48 PM
The best solution I found was to run:
compute INCREMENTAL STATS my_table;
without specifying the partitions.
Created 07-16-2015 12:05 AM
My apologies, but I am losing track of the steps you followed to produce a good/bad outcome with compute stats and compute incremental stats.
Would you be able to list a series of steps that can reproduce the problem (on a non-empty partition)?
Created 07-16-2015 05:29 AM
Here are the steps:
Using beeline:
set hive.exec.compress.output=true; set avro.output.codec=snappy; CREATE EXTERNAL TABLE my_table PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint) STORED AS AVRO LOCATION 'hdfs://localhost:8020/my_data' TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/my_table.avsc');
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=15, hour=0);
Then using impala-shell:
INVALIDATE METADATA my_table; REFRESH my_table; COMPUTE INCREMENTAL STATS my_table;
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 1 partition(s) and 46 column(s). |
+------------------------------------------+
If I try to compute incremental stats directly on the partition it does not update the partition stats:
DROP INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0);
COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0);
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 0 partition(s) and 46 column(s). |
+------------------------------------------+
Created 07-17-2015 12:11 AM
Thanks for the update. I can reproduce the issue, but only when the target partition is empty. As soon as I add some data, compute incremental stats works as expected.
So I'm still thinking you are hitting an edge case with an empty partition?
Created 07-15-2015 09:02 PM
just in case, did you try to run:
invalidate metadata;
statement?