Support Questions

Find answers, ask questions, and share your expertise

Impala not working with some Parquet files when HDFS caching is activated

avatar
Explorer

Hi all

 

When we activate HDFS caching for a partitioned table in HDFS in our cluster (CDH 5.9.0) for some files we randomly get errors for the cached files. Here an example case:

We have a partitioned table with many partitions (4 partition colums, with 1-20 partitions on each level). For simplicity I boiled the test case down to three of these partitions, let's take for example the month 10, 11, and 12.

For the paritions 10 and 11 everything works fine. For partition 12 in 50% of all cases for the simple query:

SELECT COUNT(*) FROM table WHERE month=12

I get the error message:

File hdfs://path/to/file/month=12/part-r-00000-be7725db-da77-4a34-a3c6-2e5a9276228c.snappy.parquet has invalid file metadata at file offset 40165660. Error = couldn't deserialize thrift msg:
TProtocolException: Invalid data

In the other 50% of the cases I get the correct result.

Once I deactivate caching the file is always read correctly.

 

Does anybody have an idea on the cause of this issue and on solving it?

If more information is needed, just let me know.

 

Kind Regards

4 REPLIES 4

avatar

I'm not sure that I've seen a problem exactly like that before. That error message occurs it can't read the Parquet file footer correctly.

 

We sometimes see problems like this when overwriting files in-places, because Impala's metadata about file sizes gets out of sync with the actual state of the filesystem. Does your workload involve anything like that? Have you tried running "REFRESH <table>" to force refreshing of the file metadata.

 

Just to check - is the HDFS caching addressing a specific performance problem? Often it's not necessary because the operating system is pretty effective at caching frequently-accessed files.

avatar
Explorer

Hi Tim

 

Thanks a lot for your answer.

I tried INVALIDATE METADATA <table> as well as REFRESH <table> to force refreshing the metadata of this table.

Unfortunately, the problem remains exactly the same.

 

We want to address a specific performace problem with caching.

We have one quite large table in our HDFS (in total much larger than our memory) and we want to cache some of its partitions in memory to speed up the development.

avatar

This sounds like it could be a bug then. Is there any way you could provide us with more information about the files that are causing problems so we can try to reproduce it in-house? Having the actual data is ideal, but even information about the file sizes may be helpful.

avatar
Explorer

Unfortunately, the content of this file is under NDA, so I can't provice you the file.

Some information that I can give is summarized here:

 

  • Output from "hdfs dfs -ls":
    • -rwxrwx--x+  3 hive hive 1093251527 2016-09-30 21:15 /path/to/file/month=12/part-r-00000-be7725db-da77-4a34-a3c6-2e5a9276228c.snappy.parquet
  • We have a _metadata and an _common_metadata file in the same directory (I tried removing them, but this did not resolve the issue)
  • Compression: snappy
  • It was created using: parquet-mr version 1.5.0-cdh5.7.1 (build ${buildNumber}) (output from parquet-tools, version 1.9.0)
  • Software used for creation: Bundled Spark 1.6.0 from CDH 5.7.1 (in the meantime we are using CDH 5.9.0)
  • The file contains 713 row groups
  • The file contains 867 columns (of types int64, double and binary)

One further things that I tried is copying the problematic file to a seperate directory (without the two metadata files) create a new table from this file with Impala and do the test here. Unfortunately this produces the exactly same behaviour. When it is cached I get the error message, when it is not cached, everything works fine.

 

Let me know if this helps you in understanding this problem or if you need further information (except from the contents of the file.

 

Thanks a lot already!

Kind Regards