About adnanalvee

adnanalvee · ‎04-20-2018

Could be a data skew issue. Checkout if any partition has huge chunk of the data compared to the rest. https://github.com/adnanalvee/spark-assist/blob/master/spark-assist.scala From the link above, copy the function "partitionStats" and pass in your data as a dataframe. It will show the maximum, minimum and average amount of data across your partitions like below. +------+-----+------------------+ |MAX |MIN |AVERAGE | +------+-----+------------------+ |135695|87694|100338.61149653122| +------+-----+------------------+

adnanalvee · ‎01-17-2018

Why are you using 10g of driver memory? What is the size of your dataset and how many partitions does it have? I would suggest using the config below: --executor-memory 32G \ --num-executors 20 \ --driver-memory 4g \ --executor-cores 3 \ --conf spark.driver.maxResultSize=3g \

adnanalvee · ‎10-03-2017

@Marcos Da Silva This should solve the problem as it did for mine. select column1,column2 from table where partition_column in (select max(distinct partition_column) from table)"

adnanalvee · ‎07-14-2017

@Joe Widen @Timothy Spann

adnanalvee · ‎07-14-2017

NOTES: Tried different no. executors from 10-60 but performance doesn't improve. Saving in Parquet format saves 1 minute but I dont want parquet.

adnanalvee · ‎07-13-2017

I am looping over a dataset of 1000 partitions and running operation as I go. I'm using Spark 2.0 and doing an expensive join for each of the partitions. The join takes less than a second when I call .show but when I try to save the data which is around 59 million, it takes 5 minutes.(tried reparitioning too) 5 minutes * 1000 partitions is 5000 minutes. I cannot wait that long. Any idea on optimizing the saveAsText file performance?

adnanalvee · ‎04-25-2017

Thanks a lot!

adnanalvee · ‎04-11-2017

Does Hortonworks have plans for introducing a Big Data architect certification similar to IBM?

adnanalvee · ‎04-04-2017

If you are running on cluster mode, you need to set the number of executors while submitting the JAR or you can manually enter it in the code. The former way is better spark-submit \ --master yarn-cluster \ --class com.yourCompany.code \ --executor-memory 32G \ --num-executors 5 \ --driver-memory 4g \ --executor-cores 3 \ --queue parsons \ YourJARfile.jar \ If running locally, spark-shell --master yarn --num-executors 6 --driver-memory 5g --executor-memory 7g

adnanalvee · ‎03-27-2017

@Dinesh Das Coursera has a popular one. https://www.coursera.org/specializations/scala

Online	Offline
Last Visited	‎02-21-2017 06:01 PM

Member Since	‎02-17-2017 09:33 AM
Last Visited	‎02-21-2017 06:01 PM
Posts	71
Kudos received	17

Cloudera Community

Re: Pig Incompatable schema

Re: How can I read all files in a directory using ...

Re: How to iterate multiple HDFS files in Spark-Sc...

Re: Spark executor blocks on last task

Re: Spark2 java.lang.OutOfMemoryError: Java heap ...

Re: How to optimize HIVE access to the "latest" pa...

Re: Saving data to HDFS taking too long

Re: Saving data to HDFS taking too long

Saving data to HDFS taking too long

Re: Hortonworks certified Big Data Architect ?

Hortonworks certified Big Data Architect ?

Re: How many number of executors will be created f...

Re: Scala Certification