Suppose I have a large Hive table, partitioned by date.
I INSERT OVERWRITE a partition while a read query is currently using that table.
What happens to the read query? Assume the read query is either a Hive SQL job, or a Spark SQL job.
Is whatever happens deterministic? I.E. if I repeat it exactly, will I get the same or different results?
No, it is not deterministic as far as Hive 1.X behavior, with the default of hive.support.concurrency=false. Hive 1.X has a non-ACID ZK-based lock manager, however, this makes readers wait and it's not recommended. The ACID implementation doesn't block readers, but is not available in the current HDP releases.
It may also be worth looking at EXCHANGE PARTITION, however, this is not exactly atomic, it is just a smaller window for the non-determinism.
The way it works without locks is the files are written to HDFS in a new dir, and then the dir is renamed. This can lead to a race condition without locking. Queries that started on the basis of the old directory could fail (LLAP is an exception to this rule because it uses inodes not filenames as references).