Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Appending to Hive table from multiple parallel threads

Highlighted

Appending to Hive table from multiple parallel threads

Hi, I am using HDP 2.6.5, Spark 2.3 and Hive 1.2. I am running a pyspark program and in this program a method is being called multiple times with different date ranges (begin and end date). This is being called sequentially in a loop. However, I have modified it to call this method in parallel using x number of threads with each thread having the different start and end dates. I am observing some weird behavior in the sense that, the threads do seem to append data into this table but however, for every run of the program, the counts are differing and not the same. I am using a hive managed table (not external table) which is in orc format. Data is not partitioned.

Don't have an account?
Coming from Hortonworks? Activate your account here