New Contributor
Posts: 2
Registered: ‎09-01-2017

Hive-on-Spark tasks never finish

Migrating from Hive on MR to Hive on Spark 

I'm wonder how hive + oozie action[oozie:hive2-action:0.1] on Spark[set hive.execution.engine=spark] based ran is much slower than Hive on MapReduce. 
Note: I included set hive.execution.engine=spark; in my queries and in oozie included hive2-action:0.1 in [xmlns] + provided jdbc[url]. The code is running successfully, i saw logs but it takes much clock time than usual MR.

Using Cloudera 5.9

Posts: 8
Registered: ‎03-03-2017

Re: Hive-on-Spark tasks never finish

I am also an developer like. Based on your post below are my observations

1)I see your data partitions are not being scattered across all the executors properly

2)If your data is partitioned ensure your files are not in kb's

3)Avoid using dataframe sometimes using sqlContext is much faster than dataframes

4)Ensure you haven't used broadcast functions this works only on small dataset

5)In your sql context or hive query if join exists ensure your left table is the small one and right table is large one

6)It would be better if your data is partitioned


Please follow below link that explains how to distribute your data across all the executors


Please let me know if you facing more issues


Currently incubating in Cloudera Labs:

Spark Runner for Beam SDK
Time Series for Spark