Created on 01-08-2016 11:54 AM - edited 09-16-2022 02:56 AM
We have parquet fields with relatively deep nested structure (up to 4-5 levels) and map them to external tables in hive/impala.
The default hive behaviour is to reference fields by their position (index) in table definition. It doesn't matter how you name a column/field. If the column is first in your table hive maps it to the first field in your parquet file. This is not good for schema evolution or when you want to skip some parquet fields.
By using set parquet.column.index.access=false you can make hive reference fields by name. This allows to support schema evolution (adding new fields in your parquet files not only at the end of your parquet schema for example) or other neat tricks.
Here is how to use this:
The full schema of our parquet files starts with these columns:
create external table taps_2015 ( header struct< guid : string, created_at : bigint, storage_id : string, ...
Let's map a subset of this schema in hive:
$ beeline !connect jdbc:hive2://namenode001.stg.nj1.tapad.com:10000/one_off oiepishkin xyz org.apache.hive.jdbc.HiveDriver set parquet.column.index.access=false; -- map columns in the same order as they are defined in parquet files create external table one_off.taps_test1 ( header struct< guid : string, created_at : bigint > ) stored as parquet location '/com/tapad/parquet/event/tap/2016/01/01/00'; -- let's skip guid field and define storage_id first create external table one_off.taps_test2 ( header struct< storage_id : string, created_at : bigint > ) stored as parquet location '/com/tapad/parquet/event/tap/2016/01/01/00'; -- let's query the first table select header.guid from one_off.taps_test1 limit 5; +---------------------------------------+--+ | guid | +---------------------------------------+--+ | f1d4df16-b020-11e5-968d-d0431eb4403b | | 7c256415-b022-11e5-aeed-4c7625164c8e | | b8761f30-b01c-11e5-b269-005056a279d9 | | da75dbe5-b01f-11e5-9671-d0431eaa95c8 | | e258bfd5-b01f-11e5-a096-ecf4bbcd0dc0 | +---------------------------------------+--+ 5 rows selected (18.044 seconds) -- the second table select header.storage_id from one_off.taps_test2 limit 5; +---------------------------------------+--+ | storage_id | +---------------------------------------+--+ | Kzr9afR++MEAOQbpNNIXeW3vUhs= | | 17a4535b-75da-4cf4-a896-cb1d332e4e26 | | y6b265lGrugom5Ool/icLpl+YIA= | | NULL | | NULL | +---------------------------------------+--+
So all works as expected here. Hive references fields in parquet files by name and we get different results in our queries.
Now let's query the same tables in Impala (we use hive metastore).
$ impala-shell -i localhost -d one_of Connected to localhost:21000 Server version: impalad version 2.3.0-cdh5.5.1 RELEASE (build 73bf5bc5afbb47aa7eab06cfbf6023ba8cb74f3c) *********************************************************************************** [localhost:21000] > invalidate metadata; Query: invalidate metadata Fetched 0 row(s) in 3.78s [localhost:21000] > show tables; Query: show tables +---------------------+ | name | +---------------------+ | taps_test1 | | taps_test2 | +---------------------+ Fetched 2 row(s) in 0.01s [localhost:21000] > describe one_off.taps_test1; Query: describe one_off.taps_test1 +--------+---------------------+---------+ | name | type | comment | +--------+---------------------+---------+ | header | struct< | | | | guid:string, | | | | created_at:bigint | | | | > | | +--------+---------------------+---------+ Fetched 1 row(s) in 0.02s
[localhost:21000] > select header.guid from one_off.taps_test1 limit 5; Query: select header.guid from one_off.taps_test1 limit 5 +--------------------------------------+ | header.guid | +--------------------------------------+ | f1d4df16-b020-11e5-968d-d0431eb4403b | | 7c256415-b022-11e5-aeed-4c7625164c8e | | b8761f30-b01c-11e5-b269-005056a279d9 | | da75dbe5-b01f-11e5-9671-d0431eaa95c8 | | e258bfd5-b01f-11e5-a096-ecf4bbcd0dc0 | +--------------------------------------+ Fetched 5 row(s) in 0.11s
[localhost:21000] > describe one_off.taps_test2; Query: describe one_off.taps_test2 +--------+----------------------+---------+ | name | type | comment | +--------+----------------------+---------+ | header | struct< | | | | storage_id:string, | | | | created_at:bigint | | | | > | | +--------+----------------------+---------+ Fetched 1 row(s) in 0.01s
[localhost:21000] > select header.storage_id from one_off.taps_test2 limit 5; Query: select header.storage_id from one_off.taps_test2 limit 5 +--------------------------------------+ | header.storage_id | +--------------------------------------+ | f1d4df16-b020-11e5-968d-d0431eb4403b | | 7c256415-b022-11e5-aeed-4c7625164c8e | | b8761f30-b01c-11e5-b269-005056a279d9 | | da75dbe5-b01f-11e5-9671-d0431eaa95c8 | | e258bfd5-b01f-11e5-a096-ecf4bbcd0dc0 | +--------------------------------------+ Fetched 5 row(s) in 0.06s
Both queries return same results. This is wrong. In the storage_id query we see values from the guid column.
Are there any workarounds for Impala?
If there is no ticket in impala's jira for this let me know and I make one.
Created 02-04-2016 08:53 PM
Created 01-11-2016 10:44 AM
Here is a simpler way to reproduce this in impala. I'm going to use impala to create a parquet file and then map an external table to that file.
$ impala-shell -i localhost -d one_off [localhost:21000] > create table parquet_table (field1 string, field2 string) stored as parque; Query: create table parquet_table (field1 string, field2 string) stored as parquet Fetched 0 row(s) in 0.14s [localhost:21000] > insert into parquet_table values (('f1', 'f2')); Query: insert into parquet_table values (('f1', 'f2')) Inserted 1 row(s) in 4.89s [localhost:21000] > select * from parquet_table; Query: select * from parquet_table +--------+--------+ | field1 | field2 | +--------+--------+ | f1 | f2 | +--------+--------+ Fetched 1 row(s) in 0.26s -- I went to hive metastore and found the location of the data in hdfs -- /user/hive/warehouse/one_off.db/parquet_table [localhost:21000] > create external table parquet_subset (field2 string) stored as parquet location '/user/hive/warehouse/one_off.db/parquet_table'; Query: create external table parquet_subset (field2 string) stored as parquet location '/user/hive/warehouse/one_off.db/parquet_table' Fetched 0 row(s) in 0.17s [localhost:21000] > select * from parquet_subset; Query: select * from parquet_subset +--------+ | field2 | +--------+ | f1 | +--------+ Fetched 1 row(s) in 4.01s
-- how to create parquet_subset table with a column field2 mapped to column field2 from a parquet file?
Created 01-12-2016 07:06 PM
Hi!
I'm afraid that Impala currently does not support resolving schema-to-file metadata by name - it does so only by index. There are tradeoffs between the two approaches, i.e., what ALTER TABLE operations are allowed.
There's an old but still relevant and interesting discussion on this topic here:
https://issues.cloudera.org/browse/IMPALA-779
For the time being, there is no way to make resolution by name work, but it's a frequently requested feature that we will consider adding.
Alex
Created on 01-12-2016 07:14 PM - edited 01-12-2016 07:25 PM
Thanks for the reply! By index makes schema evolution not cheap to maintain.
1. We have to make sure that all our engineers always add new fields to our avro schemas at the end of records then.
2. Also to map our historical data in impala we would have to create a separate table for each avro schema version where this requirement was not met.
3. It would be nice to be able to create a small table with just several columns mapped to a parquet file with lots of fields for one-off tasks.
I've created a new ticket which duplicates the one you mentioned (my bad - can't find it):
Created 01-12-2016 09:59 PM
Those are all valid and understandable points, thanks for filing the JIRA and your feedback!
Created 02-04-2016 08:53 PM
Created 04-13-2017 12:40 PM
Since CDH 5.8 you can use
set PARQUET_FALLBACK_SCHEMA_RESOLUTION=1;
to look up columns within Parquet files by column name.
For more, see: