Support Questions

epishkin · ‎01-08-2016

We have parquet fields with relatively deep nested structure (up to 4-5 levels) and map them to external tables in hive/impala.

The default hive behaviour is to reference fields by their position (index) in table definition. It doesn't matter how you name a column/field. If the column is first in your table hive maps it to the first field in your parquet file. This is not good for schema evolution or when you want to skip some parquet fields.

By using set parquet.column.index.access=false you can make hive reference fields by name. This allows to support schema evolution (adding new fields in your parquet files not only at the end of your parquet schema for example) or other neat tricks.

Here is how to use this:

The full schema of our parquet files starts with these columns:

create external table taps_2015 (
  header struct<
    guid : string,
    created_at : bigint,
    storage_id : string,
...

Let's map a subset of this schema in hive:

$ beeline

!connect jdbc:hive2://namenode001.stg.nj1.tapad.com:10000/one_off oiepishkin xyz org.apache.hive.jdbc.HiveDriver

set parquet.column.index.access=false;

-- map columns in the same order as they are defined in parquet files
create external table one_off.taps_test1 (
  header                   struct<
    guid                     : string,
    created_at               : bigint
  >
)
stored as parquet
location '/com/tapad/parquet/event/tap/2016/01/01/00';

-- let's skip guid field and define storage_id first
create external table one_off.taps_test2 (
  header                   struct<
    storage_id               : string,
    created_at               : bigint
  >
)
stored as parquet
location '/com/tapad/parquet/event/tap/2016/01/01/00';

-- let's query the first table
select header.guid from one_off.taps_test1 limit 5;

+---------------------------------------+--+
|                 guid                  |
+---------------------------------------+--+
| f1d4df16-b020-11e5-968d-d0431eb4403b  |
| 7c256415-b022-11e5-aeed-4c7625164c8e  |
| b8761f30-b01c-11e5-b269-005056a279d9  |
| da75dbe5-b01f-11e5-9671-d0431eaa95c8  |
| e258bfd5-b01f-11e5-a096-ecf4bbcd0dc0  |
+---------------------------------------+--+
5 rows selected (18.044 seconds)

-- the second table
select header.storage_id from one_off.taps_test2 limit 5;

+---------------------------------------+--+
|              storage_id               |
+---------------------------------------+--+
| Kzr9afR++MEAOQbpNNIXeW3vUhs=          |
| 17a4535b-75da-4cf4-a896-cb1d332e4e26  |
| y6b265lGrugom5Ool/icLpl+YIA=          |
| NULL                                  |
| NULL                                  |
+---------------------------------------+--+

So all works as expected here. Hive references fields in parquet files by name and we get different results in our queries.

Now let's query the same tables in Impala (we use hive metastore).

$ impala-shell -i localhost -d one_of

Connected to localhost:21000
Server version: impalad version 2.3.0-cdh5.5.1 RELEASE (build 73bf5bc5afbb47aa7eab06cfbf6023ba8cb74f3c)
***********************************************************************************

[localhost:21000] > invalidate metadata;
Query: invalidate metadata

Fetched 0 row(s) in 3.78s

[localhost:21000] > show tables;
Query: show tables
+---------------------+
| name                |
+---------------------+
| taps_test1          |
| taps_test2          |
+---------------------+
Fetched 2 row(s) in 0.01s

[localhost:21000] > describe one_off.taps_test1;
Query: describe one_off.taps_test1
+--------+---------------------+---------+
| name   | type                | comment |
+--------+---------------------+---------+
| header | struct<             |         |
|        |   guid:string,      |         |
|        |   created_at:bigint |         |
|        | >                   |         |
+--------+---------------------+---------+
Fetched 1 row(s) in 0.02s

[localhost:21000] > select header.guid from one_off.taps_test1 limit 5;
Query: select header.guid from one_off.taps_test1 limit 5
+--------------------------------------+
| header.guid                          |
+--------------------------------------+
| f1d4df16-b020-11e5-968d-d0431eb4403b |
| 7c256415-b022-11e5-aeed-4c7625164c8e |
| b8761f30-b01c-11e5-b269-005056a279d9 |
| da75dbe5-b01f-11e5-9671-d0431eaa95c8 |
| e258bfd5-b01f-11e5-a096-ecf4bbcd0dc0 |
+--------------------------------------+
Fetched 5 row(s) in 0.11s

[localhost:21000] > describe one_off.taps_test2;
Query: describe one_off.taps_test2
+--------+----------------------+---------+
| name   | type                 | comment |
+--------+----------------------+---------+
| header | struct<              |         |
|        |   storage_id:string, |         |
|        |   created_at:bigint  |         |
|        | >                    |         |
+--------+----------------------+---------+
Fetched 1 row(s) in 0.01s

[localhost:21000] > select header.storage_id from one_off.taps_test2 limit 5;
Query: select header.storage_id from one_off.taps_test2 limit 5
+--------------------------------------+
| header.storage_id                    |
+--------------------------------------+
| f1d4df16-b020-11e5-968d-d0431eb4403b |
| 7c256415-b022-11e5-aeed-4c7625164c8e |
| b8761f30-b01c-11e5-b269-005056a279d9 |
| da75dbe5-b01f-11e5-9671-d0431eaa95c8 |
| e258bfd5-b01f-11e5-a096-ecf4bbcd0dc0 |
+--------------------------------------+
Fetched 5 row(s) in 0.06s

Both queries return same results. This is wrong. In the storage_id query we see values from the guid column.

Are there any workarounds for Impala?

If there is no ticket in impala's jira for this let me know and I make one.

venkatsambath · ‎02-04-2016

Currently there is no support for schema-to-file metadata by name - it does so only by index.

View solution in original post

epishkin · ‎01-11-2016

Here is a simpler way to reproduce this in impala. I'm going to use impala to create a parquet file and then map an external table to that file.

$ impala-shell -i localhost -d one_off

[localhost:21000] > create table parquet_table (field1 string, field2 string) stored as parque;
Query: create table parquet_table (field1 string, field2 string) stored as parquet
Fetched 0 row(s) in 0.14s

[localhost:21000] > insert into parquet_table values (('f1', 'f2'));
Query: insert into parquet_table values (('f1', 'f2'))
Inserted 1 row(s) in 4.89s

[localhost:21000] > select * from parquet_table;
Query: select * from parquet_table
+--------+--------+
| field1 | field2 |
+--------+--------+
| f1     | f2     |
+--------+--------+
Fetched 1 row(s) in 0.26s

-- I went to hive metastore and found the location of the data in hdfs
-- /user/hive/warehouse/one_off.db/parquet_table

[localhost:21000] > create external table parquet_subset (field2 string) stored as parquet location '/user/hive/warehouse/one_off.db/parquet_table';
Query: create external table parquet_subset (field2 string) stored as parquet location '/user/hive/warehouse/one_off.db/parquet_table'

Fetched 0 row(s) in 0.17s

[localhost:21000] > select * from parquet_subset;
Query: select * from parquet_subset
+--------+
| field2 |
+--------+
| f1     |
+--------+
Fetched 1 row(s) in 4.01s

-- how to create parquet_subset table with a column field2 mapped to column field2 from a parquet file?

alex.behm · ‎01-12-2016

Hi!

I'm afraid that Impala currently does not support resolving schema-to-file metadata by name - it does so only by index. There are tradeoffs between the two approaches, i.e., what ALTER TABLE operations are allowed.

There's an old but still relevant and interesting discussion on this topic here:

https://issues.cloudera.org/browse/IMPALA-779

For the time being, there is no way to make resolution by name work, but it's a frequently requested feature that we will consider adding.

Alex

epishkin · ‎01-12-2016

Thanks for the reply! By index makes schema evolution not cheap to maintain.

1. We have to make sure that all our engineers always add new fields to our avro schemas at the end of records then.

2. Also to map our historical data in impala we would have to create a separate table for each avro schema version where this requirement was not met.

3. It would be nice to be able to create a small table with just several columns mapped to a parquet file with lots of fields for one-off tasks.

I've created a new ticket which duplicates the one you mentioned (my bad - can't find it):

https://issues.cloudera.org/browse/IMPALA-2835

alex.behm · ‎01-12-2016

Those are all valid and understandable points, thanks for filing the JIRA and your feedback!

venkatsambath · ‎02-04-2016

Currently there is no support for schema-to-file metadata by name - it does so only by index.

mszurap · ‎04-13-2017

Since CDH 5.8 you can use

set PARQUET_FALLBACK_SCHEMA_RESOLUTION=1;

to look up columns within Parquet files by column name.

For more, see:

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_fallback_schema_resol...

Cloudera Community

Support Questions

external table stored as parquet - can not use field inside a struct referenced by name