Support Questions

Find answers, ask questions, and share your expertise

How to fix Parquet schema: optional int64 amount [i:10 d:1 r:0]?

avatar
Contributor

I have a table and when I am trying to query it through imapala it throws me this error:

File 'hdfs:/path/table/1_data.0.parq' has an incompatible Parquet schema for column 'db.table.parameter_11'. Column type: STRING, Parquet schema: optional int64 amount [i:10 d:1 r:0] 

But when I query it through hive it returns results correctly. How can I fix it?

1 REPLY 1

avatar
Master Collaborator

 

The error message indicates that there is an inconsistency between the expected schema for the column 'db.table.parameter_11' and the actual schema found in the Parquet file 'hdfs:/path/table/1_data.0.parq'. The column type is expected to be a STRING, but the Parquet schema suggests that it is an optional int64 (integer) column.

To resolve this issue, you'll need to investigate and potentially correct the schema mismatch. Here are some steps you can take:

  1. Verify the Expected Schema:

    • Check the definition of the 'db.table.parameter_11' column in the Impala metadata or Hive metastore. Ensure that it is defined as a STRING type.
  2. Inspect the Parquet File Schema:

    • You can use tools like parquet-tools to inspect the schema of the Parquet file directly. Run the following command in the terminal:

      bash

 

    • parquet-tools schema 1_data.0.parq
    • Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema.

  • Compare Expected vs. Actual Schema:

    • Compare the expected schema for 'db.table.parameter_11' with the actual schema found in the Parquet file. Identify any differences in data types.
  • Investigate Data Inconsistencies:

    • If there are data inconsistencies, investigate how they might have occurred. It's possible that there was a schema evolution or a mismatch during the data writing process.
  • Resolve Schema Mismatch:

    • Depending on your findings, you may need to correct the schema mismatch. This could involve updating the metadata in Impala or Hive to match the actual schema or adjusting the Parquet file schema.
  • Update Impala Statistics:

    • After resolving the schema mismatch, it's a good practice to update Impala statistics for the affected table. This can be done using the COMPUTE STATS command in Impala:

 

    • This step ensures that Impala has up-to-date statistics for query optimization.

Here's a high-level example of what the Parquet schema inspection might look like:

parquet-tools schema 1_data.0.parq

Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema. If the data type in the Parquet schema is incorrect, you may need to investigate how the data was written and whether there were any issues during that process. Correcting the schema mismatch and updating Impala statistics should help resolve the issue.