In my work, I import AVRO files into impala tables by copy the files into HDFS then execute "refresh <table>" in impala.
But when I want to do it with compressed files, it didn't work. Here is what I do:
1. Enable Hive compress in hive shell:
hive> set hive.exec.compress.output=true;
hive> set avro.output.codec=bzip2;
2. Create a table:
CREATE TABLE `log_bzip2`(
`timestamp` bigint COMMENT 'from deserializer',
`appid` string COMMENT 'from deserializer',
`clientid` string COMMENT 'from deserializer',
`statkey` string COMMENT 'from deserializer',
`expid` string COMMENT 'from deserializer',
`modid` string COMMENT 'from deserializer',
`value` double COMMENT 'from deserializer',
`summary` string COMMENT 'from deserializer',
`custom` string COMMENT 'from deserializer')
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://szq2.appadhoc.com:8020/user/hive/log.avsc');
3. Load the compressed AVRO file into HDFS:
hdfs dfs -put log.2016-03-07.1457184357726.avro.bz2 /user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07
4. Add partition and refresh in Impala shell:
alter table log_bzip2 add partition (day="2016-03-07") location '/user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07/';
refresh log_bzip2;
5. Query it but not work:
select * from log_bzip2 limit 10;
Query: select * from log_bzip2 limit 10
WARNINGS: Invalid AVRO_VERSION_HEADER: '42 5a 68 39 '
How can I do it right? Thanks!