Member since 
    
	
		
		
		02-17-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                71
            
            
                Posts
            
        
                17
            
            
                Kudos Received
            
        
                3
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 5616 | 03-02-2017 04:19 PM | |
| 34024 | 02-20-2017 10:44 PM | |
| 20669 | 01-10-2017 06:51 PM | 
			
    
	
		
		
		04-20-2018
	
		
		08:30 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Could be a data skew issue. Checkout if any partition has huge chunk of the data compared to the rest.  https://github.com/adnanalvee/spark-assist/blob/master/spark-assist.scala  From the link above, copy the function "partitionStats" and pass in your data as a dataframe.     It will show the maximum, minimum and average amount of data across your partitions like below.      +------+-----+------------------+
    |MAX   |MIN  |AVERAGE           |
    +------+-----+------------------+
    |135695|87694|100338.61149653122|
    +------+-----+------------------+ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-17-2018
	
		
		04:07 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Why are you using 10g of driver memory? What is the size of your dataset and how many partitions does it have?    I would suggest using the config below:  --executor-memory 32G \   --num-executors 20 \   --driver-memory 4g \   --executor-cores 3 \   --conf spark.driver.maxResultSize=3g \ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-03-2017
	
		
		05:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Marcos Da Silva  This should solve the problem as it did for mine.  select column1,column2 from table where partition_column in 
(select max(distinct partition_column) from table)" 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-14-2017
	
		
		03:46 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 NOTES:  Tried different no. executors from 10-60 but performance doesn't improve.  Saving in Parquet format saves 1 minute but I dont want parquet. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-13-2017
	
		
		10:49 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I am looping over a dataset of 1000 partitions and running operation as I go.  I'm using Spark 2.0 and doing an expensive join for each of the partitions. The join takes less than a second when I call .show but when I try to save the data which is around 59 million, it takes 5 minutes.(tried reparitioning too)   5 minutes * 1000 partitions is 5000 minutes. I cannot wait that long. Any idea on optimizing the saveAsText file performance? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Hadoop
- 
						
							
		
			Apache Spark
			
    
	
		
		
		04-25-2017
	
		
		02:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks a lot! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-11-2017
	
		
		07:58 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Does Hortonworks have plans for introducing a Big Data architect certification similar to IBM? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Certification
			
    
	
		
		
		04-04-2017
	
		
		03:49 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 If you are running on cluster mode, you need to set the number of executors while submitting the JAR or you can manually enter it in the code. The former way is better  spark-submit \
--master yarn-cluster \
--class com.yourCompany.code \
--executor-memory 32G \
--num-executors 5 \
--driver-memory 4g \
--executor-cores 3 \
--queue parsons \
YourJARfile.jar \
  If running locally,  spark-shell --master yarn --num-executors 6  --driver-memory 5g --executor-memory 7g 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-27-2017
	
		
		05:11 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Dinesh Das   Coursera has a popular one.  https://www.coursera.org/specializations/scala 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













