Member since
03-27-2017
4
Posts
0
Kudos Received
0
Solutions
06-15-2017
07:31 AM
Thank you for feedback. 1. Increasing shuffle.partitions led to error : Total size of serialized results of 153680 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) 2. Using CLUSTER BY in the select reduced data shuffling from 250 GB to 1 GB and execution time was reduced from 13min to 5min. So it is a good gain. However, I was expecting that I could persist this bucketing to have a minimum shuffling, but it seems that it is not possible, Hive and Spark are not really compatible on this topic.
... View more
06-15-2017
07:27 AM
Thanks for feedback. For broadcast variables, it is not so much applicable in my case as I have big tables. Concerning filterpushdown, it has not brought results, on the contrary, execution time took longer.
... View more
06-12-2017
07:00 AM
Hello, I am loading data from Hive table with Spark and make several transformations including a join between two datasets. This join is causing a large volume of data shuffling (read) making this operation is quite slow. To avoid this such shuffling, I imagine that data in Hive should be splitted accross nodes according the fields used for join. But how to do it in practice? Using Hive bucketing ? Thank you in advance for your suggestions.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark