Support Questions

Find answers, ask questions, and share your expertise

Impala won't update stats on Hive Avro table

avatar
Explorer

Using Cloudera Express 5.4.1, I have created a Snappy compressed Avro Hive table partitioned by year, month, day, hour:

 

set hive.exec.compress.output=true;
set avro.output.codec=snappy;
CREATE EXTERNAL TABLE my_table
PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://localhost:8020/my-folder/'
TBLPROPERTIES ('avro.schema.url'='hdfs://localhost/avro/my_table.avsc');
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=1, hour=0);
INVALIDATE METADATA;

Then using the impala-shell, I compute the stats for a partition:

 

REFRESH my_table;
COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=1, hour=0);

The summary returns as follows:

+------------------------------------------+
| summary                                  |
+------------------------------------------+
| Updated 0 partition(s) and 46 column(s). |
+------------------------------------------+

Then I run SHOW TABLE STATS my_table; I get the following:

 

+-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+
| year  | month | day | hour | #Rows    | #Files | Size     | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+
| 2015  | 7     | 1   | 0    | -1       | 259    | 15.58GB  | NOT CACHED   | NOT CACHED        | AVRO   | false             |

This is showing that the stats are not updated. Furthermore, when I run an Impala query and look at the profile I see the following:

 

WARNING: The following tables are missing relevant table and/or column statistics. default.my_table

 

I am not sure this is because my Avro files are SNAPPY compressed and Impala is unable to COMPUTE STATS on Hive tables with compressed Avro files.

 

Any help would be greatly appreciated - thanks!

1 ACCEPTED SOLUTION

avatar

Thanks for the update. I can reproduce the issue, but only when the target partition is empty. As soon as I add some data, compute incremental stats works as expected.

So I'm still thinking you are hitting an edge case with an empty partition?

 

View solution in original post

8 REPLIES 8

avatar

I cannot say for sure what is going wrong here, but I suspect that you are hitting an edge case for an empty table/partition.

Have you tried the same thing with a non-empty table/partition?

avatar
Explorer
The partitions I tested had several non-empty AVRO files in them. Do you think it could be because they are SNAPPY compressed?

avatar

Impala is able to handle snappy compressed Avro files, so I don't think that's the problem.

You may be hitting https://issues.apache.org/jira/browse/HIVE-6308 since you created the tables through Hive without column defiitions.

You could try to create the tables through Impala, or create them with Hive but with column definitions.

avatar
Explorer

The best solution I found was to run:

compute INCREMENTAL STATS my_table;

without specifying the partitions. 

avatar

My apologies, but I am losing track of the steps you followed to produce a good/bad outcome with compute stats and compute incremental stats.

Would you be able to list a series of steps that can reproduce the problem (on a non-empty partition)?

avatar
Explorer

Here are the steps:

 

Using beeline:

 

set hive.exec.compress.output=true;
set avro.output.codec=snappy;

CREATE EXTERNAL TABLE my_table
PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint)
STORED AS AVRO
LOCATION 'hdfs://localhost:8020/my_data'
TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/my_table.avsc');

ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=15, hour=0);

 

Then using impala-shell:

 

INVALIDATE METADATA my_table;
REFRESH my_table;

COMPUTE INCREMENTAL STATS my_table;
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 1 partition(s) and 46 column(s). |
+------------------------------------------+

If I try to compute incremental stats directly on the partition it does not update the partition stats:

DROP INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0);
COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0);
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 0 partition(s) and 46 column(s). |
+------------------------------------------+

avatar

Thanks for the update. I can reproduce the issue, but only when the target partition is empty. As soon as I add some data, compute incremental stats works as expected.

So I'm still thinking you are hitting an edge case with an empty partition?

 

avatar
Rising Star

just in case, did you try to run:

invalidate metadata;

statement?