Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Insert Overwrite running too slow when inserting data in partitioned table

Insert Overwrite running too slow when inserting data in partitioned table

New Contributor

I am using spark with hive in my project . In the spark job , I am doing insert overwrite external table having partitioned columns. Spark job runs fine without any errors , I can see in web-UI, all tasks for the job are completed .

Now comes the painful part , I can see in logs , spark code processing is complete and now hive is trying to move the hdfs files from staging area to actual table directory of hive table . This is taking forever. Any inputs to fix this will be highly appreciated ? Please let me know if you want more details

Note : However When I run the same insert overwrite logic directly from hive script , it completes with in few minutes. (Execution engine is TEZ).

4 REPLIES 4

Re: Insert Overwrite running too slow when inserting data in partitioned table

Rising Star

Can you do a "dfs -ls" on the output for Spark job? The total # of files might be very different between SparkSQL and Hive-Tez.

Re: Insert Overwrite running too slow when inserting data in partitioned table

New Contributor

Thanks Gopal....I don't have count of files produced by hive as of now ..I am trying to get that .........but Spark SQL produced 400 odd files before it got stuck....had it run further , it might have been producing more files...do you think num of files produced by Spark-SQL...that's why its taking so much of time ?

Re: Insert Overwrite running too slow when inserting data in partitioned table

New Contributor

Spark SQL is producing around 2200 files where as TEZ is producing around 60 files.

Re: Insert Overwrite running too slow when inserting data in partitioned table

New Contributor

It looks very similar to the these issues which other people have faced https://issues.apache.org/jira/browse/HIVE-13382

http://mail-archives.apache.org/mod_mbox/hive-user/201507.mbox/%3CCAG97e2E=0DQKPFSz1Gmy9=0te3i4uU0PL...

is this patch available in hdp 2.3.4 ?

Don't have an account?
Coming from Hortonworks? Activate your account here