09-01-2017 09:42 AM
Migrating from Hive on MR to Hive on Spark
I'm wonder how hive + oozie action[oozie:hive2-action:0.1] on Spark[set hive.execution.engine=spark] based ran is much slower than Hive on MapReduce.
Note: I included set hive.execution.engine=spark; in my queries and in oozie included hive2-action:0.1 in [xmlns] + provided jdbc[url]. The code is running successfully, i saw logs but it takes much clock time than usual MR.
Using Cloudera 5.9
11-15-2017 05:03 AM
I am also an developer like. Based on your post below are my observations
1)I see your data partitions are not being scattered across all the executors properly
2)If your data is partitioned ensure your files are not in kb's
3)Avoid using dataframe sometimes using sqlContext is much faster than dataframes
4)Ensure you haven't used broadcast functions this works only on small dataset
5)In your sql context or hive query if join exists ensure your left table is the small one and right table is large one
6)It would be better if your data is partitioned
Please follow below link that explains how to distribute your data across all the executors
Please let me know if you facing more issues
Currently incubating in Cloudera Labs:Envelope