Support Questions

cjervis · ‎09-16-2019

1.We have hive table with below schema.

create external table precision_1
(
id String,
prec decimal(7,2)
)
partitioned by (date string)
stored as parquet;

2.insert into precision_1 partition(date='2019-02-01') values ('1', 2.34);

3.insert into precision_1 partition(date='2019-02-01') values ('2', 3.45);

3. 'SELECT * from precision_1' works in both impala and Hive

4.We did following in Hive and invalidated metadata in impala:

ALTER TABLE precision_1 CHANGE prec prec decimal(9,6);

insert into precision_1 partition(date='2019-01-02') values ('1', 3.765894);

'SELECT * from precision_1' works only in hive but not in imala

In impala throws the below erorr.

Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='pq)"\xf9LNN\x00\x00\x00\x00\xb1Z\x17{', guid='pq)"\xf9LNN\x00\x00\x00\x00\xb1Z\x17{')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=None, errorMessage="File 'hdfs://devana1/user/hive/warehouse/precision_1/date=2019-02-01/000000_0_copy_1' column 'prec' has a scale that does not match the table metadata scale. File metadata scale: 2 Table metadata scale: 6\n", sqlState='HY000', infoMessages=None, statusCode=3), results=None, hasMoreRows=None)

kindly suggest

tmater · ‎09-16-2019

Hi @ravikumashi,

This issue happens because individual INSERT statements open new parquet files, which means that the new file is created with the new schema. Although, Hive is able to read parquet files where the schema has different precision than the table metadata this feature is under development in Impala, please see IMPALA-7087. The parquet schema can be checked with "parquet-tools schema", it is deployed with CDH and should give similar outputs in this case like this:

# Pre-Alter
# parquet-tools schema f04187260a7a7eb5-b49e8a5000000000_938468120_data.0.parq
message schema {
  optional fixed_len_byte_array(4) prec (DECIMAL(7,2));
}
# Post-Alter
# parquet-tools schema f14ce5c3b035caed-8f4c958400000000_1136483050_data.0.parq
message schema {
  optional fixed_len_byte_array(4) prec (DECIMAL(9,6));
}

ravikumashi · ‎09-16-2019

@tmater Thank you for the update.

What if some one alters the table, without knowing this fact that it would not work in impala.

If we have a huge data set(historical data) that the user will be unable to view in impala.

what is best way to overcome this.

1.Is there a way we can edit the parquet files metadata in impala either thru spark/from command line ?.

2.we tried creating a temporary table with new schema and copying data from old table with Lateral view explode(on array/complex data type) in hive(manually with hive queries) and it is taking lot of our cluster resource and terminating without doing anything.

3.Even we tried to split the historical data(source) with date range and it is not helping us.

I believe Lateral view explode is taking lot of resource on a historical data and causing to terminate.

Notes: we have complex data type in our source table and so we are using Lateral view explode in hive

Any work around either in spark/hive/impala anything that will help to deal with historical and current data in impala would be really helpful.

Thank you for your help on this

tmater · ‎09-23-2019

@ravikumashi, if the table is altered by another user, the next Impala query likely to fail.

Parquet stores the schema internally per column chunks, I assume changing this schema would mess up the addressing in the Parquet file, please see the file format details here. I can see two possible solutions, please note that I am unaware of the use-case:

Option 1: Creating a new table with the new schema and re-creating the parquet files in Impala with INSERT INTO SELECT. At the end of this operation new parquet files with the new schema will be created in the new location.
Option 2: With UNION the two schemata can be merged. For this the new/old data has to be split to two tables. Additionally, a VIEW could be created to hide this abstraction.

Examples for the above two options, let me know if these are suitable for this use-case.

# Test tables
CREATE TABLE parquet_1 (id STRING, value DECIMAL(2,2));
INSERT INTO parquet_1 values ('1', 0.11);
CREATE TABLE parquet_2 (id STRING, value DECIMAL(4,3));
INSERT INTO parquet_2 values ('2', 1.111);

# Option 1
INSERT INTO TABLE parquet_2 SELECT * FROM parquet_1;

# Option 2
SELECT * FROM parquet_1 UNION ALL SELECT * FROM parquet_2;

Cloudera Community

Support Questions

IMPALA: scale that does not match the table metadata scale. File metadata scale: 2 Table metadata scale: 6\n