Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

what is the behaviour of select during an insert in HIVE with hive.support.concurrency=false ?

avatar

Hi,

In my organization, Hive is used with the hive.support.concurrency setted to false.

I am wondering what are the consequences about inserting datas during a select (and vice versa).

At insert, I think the table's metadatas are updated at the very end of the Map/Reduce job.

Thus, a select should be not disturbed, because I think files involved by the select are determined at the very beginning of the M/R job...

For an insert overwrite, I think this is pretty similar, but I didn't find a confirmation during my research...

Could you validate (or not ;)) my thoughts ?

Thanks 🙂

2 REPLIES 2

avatar
Super Collaborator

If your competing read/insert target a single partition this should be safe since Hive uses 'rename' file system operation at the end of insert to make new files visible. Rename is atomic on HDFS. If your insert is a dynamic partition insert then you are writing multiple partitions and the data for each partition is using the 'rename' operation. This means that some read operation could see a set of files that reflects only part of the insert.

Insert overwrite actually deletes existing files so this can conflict with a concurrent read.

avatar

Thanks it helps.

before OVERWRITE :

$ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp
Found 1 items
718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0

during OVERWRITE :

$ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp
Found 2 items
0 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/.hive-staging_hive_2017-11-09_10-35-38_682_2619781700846007196-1
718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0

after OVERWRITE :

$ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp
Found 1 items
718 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/000000_0

What I understand is that a query running (involving the file in example), for example, since 10:15 and still executing at 10:35 does not garantee a good execution (but I can presume the file, especially because it is small here, will have already been processed in a first stage of the M/R process).

Is that so ?

I am wondering if OVERWRITE is a good way to build intermediate table in this case... Without LOCK functionnality enabled, do you suggest a better way ?