Support Questions

kueyama · ‎12-07-2020

I have an external table pointing to partitioned parquet data in an AWS S3 bucket. I realized our write-out process was creating too many files within a partition, so I tweaked our code and overwrote the parquet data in that S3 location to be more compact.

I then dropped the table and re-ran the `CREATE EXTERNAL TABLE` and `ALTER TABLE ... RECOVER PARTITIONS` statements. The issue I'm running into now is that the table seems to be pointing to both the old and new parquet data. If I run a `SHOW FILES IN` command I see both old and new files listed for the table. This leads to errors when I try to access data in the table, as it's seem to be trying to read from a file that no longer exists in S3. Is there a cache or something similar that needs to be cleared in these types of situations?

Tim Armstrong · ‎12-07-2020

If you have objects that have been deleted in S3 but are showing up in file listings after refreshing the table (which sounds like the case since you dropped and recreated the table), it's possible that there's some inconsistency between the state in s3guard and the state in s3. https://docs.cloudera.com/runtime/7.0.2/cloud-data-access/topics/cr-cda-s3guard-operational-issues.h... has some background on s3guard. I'm not an s3guard expert (it's a layer Impala builds on) so don't have much to add about how you would debug/address this beyond what we have in the docs there.

One option to consider is to disable s3guard to avoid it entirely. Very recently S3 improved its consistency model to address the main problems s3guard solved (https://aws.amazon.com/s3/consistency/), so you could try disabling s3guard for that bucket to see if it solves the problem.

View solution in original post

Tim Armstrong · ‎12-07-2020

If you have objects that have been deleted in S3 but are showing up in file listings after refreshing the table (which sounds like the case since you dropped and recreated the table), it's possible that there's some inconsistency between the state in s3guard and the state in s3. https://docs.cloudera.com/runtime/7.0.2/cloud-data-access/topics/cr-cda-s3guard-operational-issues.h... has some background on s3guard. I'm not an s3guard expert (it's a layer Impala builds on) so don't have much to add about how you would debug/address this beyond what we have in the docs there.

One option to consider is to disable s3guard to avoid it entirely. Very recently S3 improved its consistency model to address the main problems s3guard solved (https://aws.amazon.com/s3/consistency/), so you could try disabling s3guard for that bucket to see if it solves the problem.

kueyama · ‎12-08-2020

Yes, that was exactly it. Since our data is created by a separate pipeline that predates our CDP usage, the s3guard table was out of sync with what was actually in the bucket. Disabling let me get things up and running again while I learn more about s3guard. I'd missed that re:invent announcement, so thanks for the help on multiple fronts!

Tim Armstrong · ‎12-08-2020

Glad to help! I'm excited about the S3 changes just cause it simplifies ingestion so much.

I add a disclaimer here in case other people read the solution. There's *some* potential for performance impact when disabling s3guard for S3-based tables with large partition counts, just because of the difference in implementation - retrieving the listing from dynamodb may be quicker than retrieving it from S3 in some scenarios.

Cloudera Community

Support Questions

impala - `recover partitions` points to old data