Reply
Highlighted
New Contributor
Posts: 2
Registered: ‎09-01-2017

Hive-on-Spark tasks never finish

Migrating from Hive on MR to Hive on Spark 


I'm wonder how hive + oozie action[oozie:hive2-action:0.1] on Spark[set hive.execution.engine=spark] based ran is much slower than Hive on MapReduce. 
Note: I included set hive.execution.engine=spark; in my queries and in oozie included hive2-action:0.1 in [xmlns] + provided jdbc[url]. The code is running successfully, i saw logs but it takes much clock time than usual MR.

Using Cloudera 5.9

New Contributor
Posts: 5
Registered: ‎03-03-2017

Re: Hive-on-Spark tasks never finish

I am also an developer like. Based on your post below are my observations

1)I see your data partitions are not being scattered across all the executors properly

2)If your data is partitioned ensure your files are not in kb's

3)Avoid using dataframe sometimes using sqlContext is much faster than dataframes

4)Ensure you haven't used broadcast functions this works only on small dataset

5)In your sql context or hive query if join exists ensure your left table is the small one and right table is large one

6)It would be better if your data is partitioned

 

Please follow below link that explains how to distribute your data across all the executors

https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-me....

 

Please let me know if you facing more issues

Announcements

Currently incubating in Cloudera Labs:

Envelope
HTrace
Ibis
Impyla
Livy
Oryx
Phoenix
Spark Runner for Beam SDK
Time Series for Spark
YCSB