Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

INSERT OVERWRITE a partition in a large table..what happens if partition is being read by another job?

Highlighted

INSERT OVERWRITE a partition in a large table..what happens if partition is being read by another job?

Explorer

Suppose I have a large Hive table, partitioned by date.

I INSERT OVERWRITE a partition while a read query is currently using that table.

What happens to the read query? Assume the read query is either a Hive SQL job, or a Spark SQL job.

Is whatever happens deterministic? I.E. if I repeat it exactly, will I get the same or different results?

Thanks!

1 REPLY 1
Highlighted

Re: INSERT OVERWRITE a partition in a large table..what happens if partition is being read by another job?

No, it is not deterministic as far as Hive 1.X behavior, with the default of hive.support.concurrency=false. Hive 1.X has a non-ACID ZK-based lock manager, however, this makes readers wait and it's not recommended. The ACID implementation doesn't block readers, but is not available in the current HDP releases.

It may also be worth looking at EXCHANGE PARTITION, however, this is not exactly atomic, it is just a smaller window for the non-determinism.

The way it works without locks is the files are written to HDFS in a new dir, and then the dir is renamed. This can lead to a race condition without locking. Queries that started on the basis of the old directory could fail (LLAP is an exception to this rule because it uses inodes not filenames as references).

Don't have an account?
Coming from Hortonworks? Activate your account here