- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
INSERT OVERWRITE a partition in a large table..what happens if partition is being read by another job?
- Labels:
-
Apache Hive
Created ‎02-06-2018 07:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Suppose I have a large Hive table, partitioned by date.
I INSERT OVERWRITE a partition while a read query is currently using that table.
What happens to the read query? Assume the read query is either a Hive SQL job, or a Spark SQL job.
Is whatever happens deterministic? I.E. if I repeat it exactly, will I get the same or different results?
Thanks!
Created ‎02-06-2018 08:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, it is not deterministic as far as Hive 1.X behavior, with the default of hive.support.concurrency=false. Hive 1.X has a non-ACID ZK-based lock manager, however, this makes readers wait and it's not recommended. The ACID implementation doesn't block readers, but is not available in the current HDP releases.
It may also be worth looking at EXCHANGE PARTITION, however, this is not exactly atomic, it is just a smaller window for the non-determinism.
The way it works without locks is the files are written to HDFS in a new dir, and then the dir is renamed. This can lead to a race condition without locking. Queries that started on the basis of the old directory could fail (LLAP is an exception to this rule because it uses inodes not filenames as references).
