Member since
02-17-2017
71
Posts
17
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4453 | 03-02-2017 04:19 PM | |
32342 | 02-20-2017 10:44 PM | |
19011 | 01-10-2017 06:51 PM |
04-20-2018
08:30 PM
Could be a data skew issue. Checkout if any partition has huge chunk of the data compared to the rest. https://github.com/adnanalvee/spark-assist/blob/master/spark-assist.scala From the link above, copy the function "partitionStats" and pass in your data as a dataframe. It will show the maximum, minimum and average amount of data across your partitions like below. +------+-----+------------------+
|MAX |MIN |AVERAGE |
+------+-----+------------------+
|135695|87694|100338.61149653122|
+------+-----+------------------+
... View more
01-17-2018
04:07 PM
Why are you using 10g of driver memory? What is the size of your dataset and how many partitions does it have? I would suggest using the config below: --executor-memory 32G \ --num-executors 20 \ --driver-memory 4g \ --executor-cores 3 \ --conf spark.driver.maxResultSize=3g \
... View more
10-03-2017
05:38 PM
@Marcos Da Silva This should solve the problem as it did for mine. select column1,column2 from table where partition_column in
(select max(distinct partition_column) from table)"
... View more
07-14-2017
03:46 PM
NOTES: Tried different no. executors from 10-60 but performance doesn't improve. Saving in Parquet format saves 1 minute but I dont want parquet.
... View more
07-13-2017
10:49 PM
I am looping over a dataset of 1000 partitions and running operation as I go. I'm using Spark 2.0 and doing an expensive join for each of the partitions. The join takes less than a second when I call .show but when I try to save the data which is around 59 million, it takes 5 minutes.(tried reparitioning too) 5 minutes * 1000 partitions is 5000 minutes. I cannot wait that long. Any idea on optimizing the saveAsText file performance?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
04-25-2017
02:56 PM
Thanks a lot!
... View more
04-11-2017
07:58 PM
2 Kudos
Does Hortonworks have plans for introducing a Big Data architect certification similar to IBM?
... View more
Labels:
- Labels:
-
Certification
04-04-2017
03:49 PM
1 Kudo
If you are running on cluster mode, you need to set the number of executors while submitting the JAR or you can manually enter it in the code. The former way is better spark-submit \
--master yarn-cluster \
--class com.yourCompany.code \
--executor-memory 32G \
--num-executors 5 \
--driver-memory 4g \
--executor-cores 3 \
--queue parsons \
YourJARfile.jar \
If running locally, spark-shell --master yarn --num-executors 6 --driver-memory 5g --executor-memory 7g
... View more
03-27-2017
05:11 PM
1 Kudo
@Dinesh Das Coursera has a popular one. https://www.coursera.org/specializations/scala
... View more