Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala won't update stats on Hive Avro table

Solved Go to solution
Highlighted

Impala won't update stats on Hive Avro table

New Contributor

Using Cloudera Express 5.4.1, I have created a Snappy compressed Avro Hive table partitioned by year, month, day, hour:

 

set hive.exec.compress.output=true;
set avro.output.codec=snappy;
CREATE EXTERNAL TABLE my_table
PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://localhost:8020/my-folder/'
TBLPROPERTIES ('avro.schema.url'='hdfs://localhost/avro/my_table.avsc');
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=1, hour=0);
INVALIDATE METADATA;

Then using the impala-shell, I compute the stats for a partition:

 

REFRESH my_table;
COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=1, hour=0);

The summary returns as follows:

+------------------------------------------+
| summary                                  |
+------------------------------------------+
| Updated 0 partition(s) and 46 column(s). |
+------------------------------------------+

Then I run SHOW TABLE STATS my_table; I get the following:

 

+-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+
| year  | month | day | hour | #Rows    | #Files | Size     | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+
| 2015  | 7     | 1   | 0    | -1       | 259    | 15.58GB  | NOT CACHED   | NOT CACHED        | AVRO   | false             |

This is showing that the stats are not updated. Furthermore, when I run an Impala query and look at the profile I see the following:

 

WARNING: The following tables are missing relevant table and/or column statistics. default.my_table

 

I am not sure this is because my Avro files are SNAPPY compressed and Impala is unable to COMPUTE STATS on Hive tables with compressed Avro files.

 

Any help would be greatly appreciated - thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Impala won't update stats on Hive Avro table

Master Collaborator

Thanks for the update. I can reproduce the issue, but only when the target partition is empty. As soon as I add some data, compute incremental stats works as expected.

So I'm still thinking you are hitting an edge case with an empty partition?

 

View solution in original post

8 REPLIES 8
Highlighted

Re: Impala won't update stats on Hive Avro table

Master Collaborator

I cannot say for sure what is going wrong here, but I suspect that you are hitting an edge case for an empty table/partition.

Have you tried the same thing with a non-empty table/partition?

Highlighted

Re: Impala won't update stats on Hive Avro table

New Contributor
The partitions I tested had several non-empty AVRO files in them. Do you think it could be because they are SNAPPY compressed?
Highlighted

Re: Impala won't update stats on Hive Avro table

Master Collaborator

Impala is able to handle snappy compressed Avro files, so I don't think that's the problem.

You may be hitting https://issues.apache.org/jira/browse/HIVE-6308 since you created the tables through Hive without column defiitions.

You could try to create the tables through Impala, or create them with Hive but with column definitions.

Highlighted

Re: Impala won't update stats on Hive Avro table

New Contributor

The best solution I found was to run:

compute INCREMENTAL STATS my_table;

without specifying the partitions. 

Highlighted

Re: Impala won't update stats on Hive Avro table

Master Collaborator

My apologies, but I am losing track of the steps you followed to produce a good/bad outcome with compute stats and compute incremental stats.

Would you be able to list a series of steps that can reproduce the problem (on a non-empty partition)?

Highlighted

Re: Impala won't update stats on Hive Avro table

New Contributor

Here are the steps:

 

Using beeline:

 

set hive.exec.compress.output=true;
set avro.output.codec=snappy;

CREATE EXTERNAL TABLE my_table
PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint)
STORED AS AVRO
LOCATION 'hdfs://localhost:8020/my_data'
TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/my_table.avsc');

ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=15, hour=0);

 

Then using impala-shell:

 

INVALIDATE METADATA my_table;
REFRESH my_table;

COMPUTE INCREMENTAL STATS my_table;
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 1 partition(s) and 46 column(s). |
+------------------------------------------+

If I try to compute incremental stats directly on the partition it does not update the partition stats:

DROP INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0);
COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0);
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 0 partition(s) and 46 column(s). |
+------------------------------------------+
Highlighted

Re: Impala won't update stats on Hive Avro table

Master Collaborator

Thanks for the update. I can reproduce the issue, but only when the target partition is empty. As soon as I add some data, compute incremental stats works as expected.

So I'm still thinking you are hitting an edge case with an empty partition?

 

View solution in original post

Re: Impala won't update stats on Hive Avro table

Rising Star

just in case, did you try to run:

invalidate metadata;

statement?

Don't have an account?
Coming from Hortonworks? Activate your account here