Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Insert Overwrite running too slow when inserting data in partitioned table

avatar
Explorer

I am using spark with hive in my project . In the spark job , I am doing insert overwrite external table having partitioned columns. Spark job runs fine without any errors , I can see in web-UI, all tasks for the job are completed .

Now comes the painful part , I can see in logs , spark code processing is complete and now hive is trying to move the hdfs files from staging area to actual table directory of hive table . This is taking forever. Any inputs to fix this will be highly appreciated ? Please let me know if you want more details

Note : However When I run the same insert overwrite logic directly from hive script , it completes with in few minutes. (Execution engine is TEZ).

4 REPLIES 4

avatar
Expert Contributor

Can you do a "dfs -ls" on the output for Spark job? The total # of files might be very different between SparkSQL and Hive-Tez.

avatar
Explorer

Thanks Gopal....I don't have count of files produced by hive as of now ..I am trying to get that .........but Spark SQL produced 400 odd files before it got stuck....had it run further , it might have been producing more files...do you think num of files produced by Spark-SQL...that's why its taking so much of time ?

avatar
Explorer

Spark SQL is producing around 2200 files where as TEZ is producing around 60 files.

avatar
Explorer

It looks very similar to the these issues which other people have faced https://issues.apache.org/jira/browse/HIVE-13382

http://mail-archives.apache.org/mod_mbox/hive-user/201507.mbox/%3CCAG97e2E=0DQKPFSz1Gmy9=0te3i4uU0PL...

is this patch available in hdp 2.3.4 ?