Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala-over-hbase table has columns in alphabetical order instead of as specified

avatar
Rising Star

When I create an impala/hive table over an hbase table, the columns in Impala appear in alphabetical order instead of as defined.  Not a blocker, but really annoying and might become an issue down the road. 

 

Anyone know what could be happening? We're on CDH 5.13, thanks.

 

 

Defined in hbase with:

 

create 'irdw_sandbox:date_dim', {NAME => 'mcf', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'true', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'PREFIX_TREE', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}

Defined in beehive with:

 

CREATE EXTERNAL TABLE irdw_sandbox.hbase_date_dim (
key STRING,
id INT,
sqldate TIMESTAMP,
year INT,
quarter_of_year INT,
month_of_year INT,
week_of_year INT,
day_of_year INT,
day_name STRING,
day_of_week INT,
day_of_month INT,
day_type STRING,
month_name STRING,
week_of_year_mtos INT
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" =
":key,mcf:id,mcf:sqldate,mcf:year,mcf:quarter_of_year,mcf:month_of_year,mcf:week_of_year,mcf:day_of_year,mcf:day_name,mcf:day_of_week,mcf:day_of_month,mcf:day_type,mcf:month_name,mcf:week_of_year_mtos"
)
TBLPROPERTIES("hbase.table.name" = "irdw_sandbox:date_dim")

 

and a show create table in hive looks fine:

 

CREATE EXTERNAL TABLE `hbase_date_dim`(
  `key` string COMMENT 'from deserializer', 
  `id` int COMMENT 'from deserializer', 
  `sqldate` timestamp COMMENT 'from deserializer', 
  `year` int COMMENT 'from deserializer', 
  `quarter_of_year` int COMMENT 'from deserializer', 
  `month_of_year` int COMMENT 'from deserializer', 
  `week_of_year` int COMMENT 'from deserializer', 
  `day_of_year` int COMMENT 'from deserializer', 
  `day_name` string COMMENT 'from deserializer', 
  `day_of_week` int COMMENT 'from deserializer', 
  `day_of_month` int COMMENT 'from deserializer', 
  `day_type` string COMMENT 'from deserializer', 
  `month_name` string COMMENT 'from deserializer', 
  `week_of_year_mtos` int COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,mcf:id,mcf:sqldate,mcf:year,mcf:quarter_of_year,mcf:month_of_year,mcf:week_of_year,mcf:day_of_year,mcf:day_name,mcf:day_of_week,mcf:day_of_month,mcf:day_type,mcf:month_name,mcf:week_of_year_mtos', 
  'serialization.format'='1')
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'hbase.table.name'='irdw_sandbox:date_dim', 
  'numFiles'='0', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='0', 

but a show create table in impala (after invalidate metadata to recognize the new table) doesn't:

 

CREATE EXTERNAL TABLE irdw_sandbox.hbase_date_dim (
   key STRING,   
   day_name STRING,   
   day_of_month INT,   
   day_of_week INT,   
   day_of_year INT,   
   day_type STRING,   
   id INT,   
   month_name STRING,   
   month_of_year INT,   
   quarter_of_year INT,   
   sqldate TIMESTAMP,   
   week_of_year INT,   
   week_of_year_mtos INT,   
   year INT 
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ('hbase.columns.mapping'=':key,mcf:id,mcf:sqldate,mcf:year,mcf:quarter_of_year,mcf:month_of_year,mcf:week_of_year,mcf:day_of_year,mcf:day_name,mcf:day_of_week,mcf:day_of_month,mcf:day_type,mcf:month_name,mcf:week_of_year_mtos', 'serialization.format'='1') 
TBLPROPERTIES ('COLUMN_STATS_ACCURATE'='false', 'hbase.table.name'='irdw_sandbox:date_dim', 'numFiles'='0', 'numRows'='-1', 'rawDataSize'='-1', 'storage_handler'='org.apache.hadoop.hive.hbase.HBaseStorageHandler', 'totalSize'='0')

 

1 ACCEPTED SOLUTION

avatar

I think this is an area where the mismatch between HBase's data model and the traditional relational model cause some weirdness - from what I understand, HBase doesn't have a idea of an arbitrary order of columns in a table - I think Impala is not preserving the creation order when reloading the table.

 

There are some known issues in this area, maybe we could do better: https://issues.apache.org/jira/browse/IMPALA-886

View solution in original post

2 REPLIES 2

avatar

I think this is an area where the mismatch between HBase's data model and the traditional relational model cause some weirdness - from what I understand, HBase doesn't have a idea of an arbitrary order of columns in a table - I think Impala is not preserving the creation order when reloading the table.

 

There are some known issues in this area, maybe we could do better: https://issues.apache.org/jira/browse/IMPALA-886

avatar
Rising Star
Thanks very much Tim for looking up the JIRA. Yikes, been open since 2014. As John pointed out there, column order info must be in the metastore since hive show create table displays fine, so seems like this should be a simple change to how impala reads that info. Upvoted the JIRA.