When we activate HDFS caching for a partitioned table in HDFS in our cluster (CDH 5.9.0) for some files we randomly get errors for the cached files. Here an example case:
We have a partitioned table with many partitions (4 partition colums, with 1-20 partitions on each level). For simplicity I boiled the test case down to three of these partitions, let's take for example the month 10, 11, and 12.
For the paritions 10 and 11 everything works fine. For partition 12 in 50% of all cases for the simple query:
SELECT COUNT(*) FROM table WHERE month=12
I get the error message:
File hdfs://path/to/file/month=12/part-r-00000-be7725db-da77-4a34-a3c6-2e5a9276228c.snappy.parquet has invalid file metadata at file offset 40165660. Error = couldn't deserialize thrift msg: TProtocolException: Invalid data
In the other 50% of the cases I get the correct result.
Once I deactivate caching the file is always read correctly.
Does anybody have an idea on the cause of this issue and on solving it?
If more information is needed, just let me know.
I'm not sure that I've seen a problem exactly like that before. That error message occurs it can't read the Parquet file footer correctly.
We sometimes see problems like this when overwriting files in-places, because Impala's metadata about file sizes gets out of sync with the actual state of the filesystem. Does your workload involve anything like that? Have you tried running "REFRESH <table>" to force refreshing of the file metadata.
Just to check - is the HDFS caching addressing a specific performance problem? Often it's not necessary because the operating system is pretty effective at caching frequently-accessed files.
Thanks a lot for your answer.
I tried INVALIDATE METADATA <table> as well as REFRESH <table> to force refreshing the metadata of this table.
Unfortunately, the problem remains exactly the same.
We want to address a specific performace problem with caching.
We have one quite large table in our HDFS (in total much larger than our memory) and we want to cache some of its partitions in memory to speed up the development.
This sounds like it could be a bug then. Is there any way you could provide us with more information about the files that are causing problems so we can try to reproduce it in-house? Having the actual data is ideal, but even information about the file sizes may be helpful.
Unfortunately, the content of this file is under NDA, so I can't provice you the file.
Some information that I can give is summarized here:
One further things that I tried is copying the problematic file to a seperate directory (without the two metadata files) create a new table from this file with Impala and do the test here. Unfortunately this produces the exactly same behaviour. When it is cached I get the error message, when it is not cached, everything works fine.
Let me know if this helps you in understanding this problem or if you need further information (except from the contents of the file.
Thanks a lot already!