Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

impala - `recover partitions` points to old data

avatar
Explorer

I have an external table pointing to partitioned parquet data in an AWS S3 bucket. I realized our write-out process was creating too many files within a partition, so I tweaked our code and overwrote the parquet data in that S3 location to be more compact.

 

I then dropped the table and re-ran the `CREATE EXTERNAL TABLE` and `ALTER TABLE ... RECOVER PARTITIONS` statements. The issue I'm running into now is that the table seems to be pointing to both the old and new parquet data. If I run a `SHOW FILES IN` command I see both old and new files listed for the table. This leads to errors when I try to access data in the table, as it's seem to be trying to read from a file that no longer exists in S3. Is there a cache or something similar that needs to be cleared in these types of situations?

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Explorer

Yes, that was exactly it. Since our data is created by a separate pipeline that predates our CDP usage, the s3guard table was out of sync with what was actually in the bucket. Disabling let me get things up and running again while I learn more about s3guard. I'd missed that re:invent announcement, so thanks for the help on multiple fronts!

avatar

Glad to help! I'm excited about the S3 changes just cause it simplifies ingestion so much.

 

I add a disclaimer here in case other people read the solution. There's *some* potential for performance impact when disabling s3guard for S3-based tables with large partition counts, just because of the difference in implementation - retrieving the listing from dynamodb may be quicker than retrieving it from S3 in some scenarios.