About jgourdet

jgourdet · ‎06-15-2017

Thank you for feedback. 1. Increasing shuffle.partitions led to error : Total size of serialized results of 153680 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) 2. Using CLUSTER BY in the select reduced data shuffling from 250 GB to 1 GB and execution time was reduced from 13min to 5min. So it is a good gain. However, I was expecting that I could persist this bucketing to have a minimum shuffling, but it seems that it is not possible, Hive and Spark are not really compatible on this topic.

jgourdet · ‎06-15-2017

Thanks for feedback. For broadcast variables, it is not so much applicable in my case as I have big tables. Concerning filterpushdown, it has not brought results, on the contrary, execution time took longer.

jgourdet · ‎06-12-2017

Hello, I am loading data from Hive table with Spark and make several transformations including a join between two datasets. This join is causing a large volume of data shuffling (read) making this operation is quite slow. To avoid this such shuffling, I imagine that data in Hive should be splitted accross nodes according the fields used for join. But how to do it in practice? Using Hive bucketing ? Thank you in advance for your suggestions.

Online	Offline
Last Visited	‎02-22-2018 04:33 PM

Member Since	‎03-27-2017 07:32 AM
Last Visited	‎02-22-2018 04:33 PM
Posts	4

Cloudera Community

Re: How to reduce Spark shuffling caused by join w...

Re: How to reduce Spark shuffling caused by join w...

How to reduce Spark shuffling caused by join with ...