Member since
07-11-2015
5
Posts
0
Kudos Received
0
Solutions
07-16-2015
05:29 AM
Here are the steps: Using beeline: set hive.exec.compress.output=true;
set avro.output.codec=snappy;
CREATE EXTERNAL TABLE my_table
PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint)
STORED AS AVRO
LOCATION 'hdfs://localhost:8020/my_data'
TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/my_table.avsc'); ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=15, hour=0); Then using impala-shell: INVALIDATE METADATA my_table;
REFRESH my_table;
COMPUTE INCREMENTAL STATS my_table; +------------------------------------------+ | summary | +------------------------------------------+ | Updated 1 partition(s) and 46 column(s). | +------------------------------------------+ If I try to compute incremental stats directly on the partition it does not update the partition stats: DROP INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0); COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=15, hour=0); +------------------------------------------+ | summary | +------------------------------------------+ | Updated 0 partition(s) and 46 column(s). | +------------------------------------------+
... View more
07-15-2015
11:39 AM
The best solution I found was to run: compute INCREMENTAL STATS my_table; without specifying the partitions.
... View more
07-14-2015
06:39 AM
The partitions I tested had several non-empty AVRO files in them. Do you think it could be because they are SNAPPY compressed?
... View more
07-11-2015
05:21 AM
Using Cloudera Express 5.4.1, I have created a Snappy compressed Avro Hive table partitioned by year, month, day, hour: set hive.exec.compress.output=true;
set avro.output.codec=snappy;
CREATE EXTERNAL TABLE my_table
PARTITIONED BY (year smallint, month tinyint, day tinyint, hour tinyint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://localhost:8020/my-folder/'
TBLPROPERTIES ('avro.schema.url'='hdfs://localhost/avro/my_table.avsc');
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION (year=2015, month=7, day=1, hour=0);
INVALIDATE METADATA; Then using the impala-shell, I compute the stats for a partition: REFRESH my_table;
COMPUTE INCREMENTAL STATS my_table PARTITION (year=2015, month=7, day=1, hour=0); The summary returns as follows: +------------------------------------------+
| summary |
+------------------------------------------+
| Updated 0 partition(s) and 46 column(s). |
+------------------------------------------+ Then I run SHOW TABLE STATS my_table; I get the following: +-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+
| year | month | day | hour | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+-------+-----+------+----------+--------+----------+--------------+-------------------+--------+-------------------+
| 2015 | 7 | 1 | 0 | -1 | 259 | 15.58GB | NOT CACHED | NOT CACHED | AVRO | false | This is showing that the stats are not updated. Furthermore, when I run an Impala query and look at the profile I see the following: WARNING: The following tables are missing relevant table and/or column statistics. default.my_table I am not sure this is because my Avro files are SNAPPY compressed and Impala is unable to COMPUTE STATS on Hive tables with compressed Avro files. Any help would be greatly appreciated - thanks!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Impala
-
HDFS