About kylebush

kylebush · ‎07-16-2015

Here are the steps: Using beeline: set hive.exec.compress.output=true; set avro.output.codec=snappy; CREATE EXTERNAL TABLE my_table PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint) STORED AS AVRO LOCATION 'hdfs://localhost:8020/my_data' TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/my_table.avsc'); ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=15, hour=0); Then using impala-shell: INVALIDATE METADATA my_table; REFRESH my_table; COMPUTE INCREMENTAL STATS my_table; +------------------------------------------+ | summary | +------------------------------------------+ | Updated 1 partition(s) and 46 column(s). | +------------------------------------------+ If I try to compute incremental stats directly on the partition it does not update the partition stats: DROP INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0); COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0); +------------------------------------------+ | summary | +------------------------------------------+ | Updated 0 partition(s) and 46 column(s). | +------------------------------------------+

kylebush · ‎07-15-2015

The best solution I found was to run: compute INCREMENTAL STATS my_table; without specifying the partitions.

kylebush · ‎07-14-2015

The partitions I tested had several non-empty AVRO files in them. Do you think it could be because they are SNAPPY compressed?

kylebush · ‎07-11-2015

Using Cloudera Express 5.4.1, I have created a Snappy compressed Avro Hive table partitioned by year, month, day, hour: set hive.exec.compress.output=true; set avro.output.codec=snappy; CREATE EXTERNAL TABLE my_table PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://localhost:8020/my-folder/' TBLPROPERTIES ('avro.schema.url'='hdfs://localhost/avro/my_table.avsc'); ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=1, hour=0); INVALIDATE METADATA; Then using the impala-shell, I compute the stats for a partition: REFRESH my_table; COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=1, hour=0); The summary returns as follows: +------------------------------------------+ | summary | +------------------------------------------+ | Updated 0 partition(s) and 46 column(s). | +------------------------------------------+ Then I run SHOW TABLE STATS my_table; I get the following: +-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+ | year | month | day | hour | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+ | 2015 | 7 | 1 | 0 | -1 | 259 | 15.58GB | NOT CACHED | NOT CACHED | AVRO | false | This is showing that the stats are not updated. Furthermore, when I run an Impala query and look at the profile I see the following: WARNING: The following tables are missing relevant table and/or column statistics. default.my_table I am not sure this is because my Avro files are SNAPPY compressed and Impala is unable to COMPUTE STATS on Hive tables with compressed Avro files. Any help would be greatly appreciated - thanks!

Online	Offline
Last Visited	‎12-20-2016 06:26 PM

Member Since	‎07-11-2015 05:18 AM
Last Visited	‎12-20-2016 06:26 PM
Posts	5

Cloudera Community

Re: Impala won't update stats on Hive Avro table

Re: Impala won't update stats on Hive Avro table

Re: Impala won't update stats on Hive Avro table

Impala won't update stats on Hive Avro table