Created on 03-26-2018 07:21 PM - edited 09-16-2022 06:02 AM
When I create an impala/hive table over an hbase table, the columns in Impala appear in alphabetical order instead of as defined. Not a blocker, but really annoying and might become an issue down the road.
Anyone know what could be happening? We're on CDH 5.13, thanks.
Defined in hbase with:
create 'irdw_sandbox:date_dim', {NAME => 'mcf', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'true', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'PREFIX_TREE', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
Defined in beehive with:
CREATE EXTERNAL TABLE irdw_sandbox.hbase_date_dim ( key STRING, id INT, sqldate TIMESTAMP, year INT, quarter_of_year INT, month_of_year INT, week_of_year INT, day_of_year INT, day_name STRING, day_of_week INT, day_of_month INT, day_type STRING, month_name STRING, week_of_year_mtos INT ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,mcf:id,mcf:sqldate,mcf:year,mcf:quarter_of_year,mcf:month_of_year,mcf:week_of_year,mcf:day_of_year,mcf:day_name,mcf:day_of_week,mcf:day_of_month,mcf:day_type,mcf:month_name,mcf:week_of_year_mtos" ) TBLPROPERTIES("hbase.table.name" = "irdw_sandbox:date_dim")
and a show create table in hive looks fine:
CREATE EXTERNAL TABLE `hbase_date_dim`( `key` string COMMENT 'from deserializer', `id` int COMMENT 'from deserializer', `sqldate` timestamp COMMENT 'from deserializer', `year` int COMMENT 'from deserializer', `quarter_of_year` int COMMENT 'from deserializer', `month_of_year` int COMMENT 'from deserializer', `week_of_year` int COMMENT 'from deserializer', `day_of_year` int COMMENT 'from deserializer', `day_name` string COMMENT 'from deserializer', `day_of_week` int COMMENT 'from deserializer', `day_of_month` int COMMENT 'from deserializer', `day_type` string COMMENT 'from deserializer', `month_name` string COMMENT 'from deserializer', `week_of_year_mtos` int COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,mcf:id,mcf:sqldate,mcf:year,mcf:quarter_of_year,mcf:month_of_year,mcf:week_of_year,mcf:day_of_year,mcf:day_name,mcf:day_of_week,mcf:day_of_month,mcf:day_type,mcf:month_name,mcf:week_of_year_mtos', 'serialization.format'='1') TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'hbase.table.name'='irdw_sandbox:date_dim', 'numFiles'='0', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='0',
but a show create table in impala (after invalidate metadata to recognize the new table) doesn't:
CREATE EXTERNAL TABLE irdw_sandbox.hbase_date_dim ( key STRING, day_name STRING, day_of_month INT, day_of_week INT, day_of_year INT, day_type STRING, id INT, month_name STRING, month_of_year INT, quarter_of_year INT, sqldate TIMESTAMP, week_of_year INT, week_of_year_mtos INT, year INT ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping'=':key,mcf:id,mcf:sqldate,mcf:year,mcf:quarter_of_year,mcf:month_of_year,mcf:week_of_year,mcf:day_of_year,mcf:day_name,mcf:day_of_week,mcf:day_of_month,mcf:day_type,mcf:month_name,mcf:week_of_year_mtos', 'serialization.format'='1') TBLPROPERTIES ('COLUMN_STATS_ACCURATE'='false', 'hbase.table.name'='irdw_sandbox:date_dim', 'numFiles'='0', 'numRows'='-1', 'rawDataSize'='-1', 'storage_handler'='org.apache.hadoop.hive.hbase.HBaseStorageHandler', 'totalSize'='0')
Created 03-27-2018 10:58 AM
I think this is an area where the mismatch between HBase's data model and the traditional relational model cause some weirdness - from what I understand, HBase doesn't have a idea of an arbitrary order of columns in a table - I think Impala is not preserving the creation order when reloading the table.
There are some known issues in this area, maybe we could do better: https://issues.apache.org/jira/browse/IMPALA-886
Created 03-27-2018 10:58 AM
I think this is an area where the mismatch between HBase's data model and the traditional relational model cause some weirdness - from what I understand, HBase doesn't have a idea of an arbitrary order of columns in a table - I think Impala is not preserving the creation order when reloading the table.
There are some known issues in this area, maybe we could do better: https://issues.apache.org/jira/browse/IMPALA-886
Created 03-27-2018 01:57 PM