Created 11-07-2017 10:24 AM
Hi,
In my organization, Hive is used with the hive.support.concurrency setted to false.
I am wondering what are the consequences about inserting datas during a select (and vice versa).
At insert, I think the table's metadatas are updated at the very end of the Map/Reduce job.
Thus, a select should be not disturbed, because I think files involved by the select are determined at the very beginning of the M/R job...
For an insert overwrite, I think this is pretty similar, but I didn't find a confirmation during my research...
Could you validate (or not ;)) my thoughts ?
Thanks 🙂
Created 11-07-2017 05:51 PM
If your competing read/insert target a single partition this should be safe since Hive uses 'rename' file system operation at the end of insert to make new files visible. Rename is atomic on HDFS. If your insert is a dynamic partition insert then you are writing multiple partitions and the data for each partition is using the 'rename' operation. This means that some read operation could see a set of files that reflects only part of the insert.
Insert overwrite actually deletes existing files so this can conflict with a concurrent read.
Created 11-09-2017 09:55 AM
Thanks it helps.
before OVERWRITE :
$ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp Found 1 items 718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0
during OVERWRITE :
$ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp Found 2 items 0 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/.hive-staging_hive_2017-11-09_10-35-38_682_2619781700846007196-1 718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0
after OVERWRITE :
$ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp Found 1 items 718 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/000000_0
What I understand is that a query running (involving the file in example), for example, since 10:15 and still executing at 10:35 does not garantee a good execution (but I can presume the file, especially because it is small here, will have already been processed in a first stage of the M/R process).
Is that so ?
I am wondering if OVERWRITE is a good way to build intermediate table in this case... Without LOCK functionnality enabled, do you suggest a better way ?