I am using spark with hive in my project . In the spark job , I am doing insert overwrite external table having partitioned columns. Spark job runs fine without any errors , I can see in web-UI, all tasks for the job are completed .
Now comes the painful part , I can see in logs , spark code processing is complete and now hive is trying to move the hdfs files from staging area to actual table directory of hive table . This is taking forever. Any inputs to fix this will be highly appreciated ? Please let me know if you want more details
Note : However When I run the same insert overwrite logic directly from hive script , it completes with in few minutes. (Execution engine is TEZ).
Can you do a "dfs -ls" on the output for Spark job? The total # of files might be very different between SparkSQL and Hive-Tez.
Thanks Gopal....I don't have count of files produced by hive as of now ..I am trying to get that .........but Spark SQL produced 400 odd files before it got stuck....had it run further , it might have been producing more files...do you think num of files produced by Spark-SQL...that's why its taking so much of time ?
Spark SQL is producing around 2200 files where as TEZ is producing around 60 files.
It looks very similar to the these issues which other people have faced https://issues.apache.org/jira/browse/HIVE-13382
is this patch available in hdp 2.3.4 ?