Member since 
    
	
		
		
		02-17-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                71
            
            
                Posts
            
        
                17
            
            
                Kudos Received
            
        
                3
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 5621 | 03-02-2017 04:19 PM | |
| 34026 | 02-20-2017 10:44 PM | |
| 20670 | 01-10-2017 06:51 PM | 
			
    
	
		
		
		04-17-2020
	
		
		03:02 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @testingsauce I am also facing this issue. Saved df in HIVE using saveAsTable but when i try to fetch results using hiveContext.sql(query), it doesn't return anything. BADLY stuck. Please help 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-20-2018
	
		
		08:30 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Could be a data skew issue. Checkout if any partition has huge chunk of the data compared to the rest.  https://github.com/adnanalvee/spark-assist/blob/master/spark-assist.scala  From the link above, copy the function "partitionStats" and pass in your data as a dataframe.     It will show the maximum, minimum and average amount of data across your partitions like below.      +------+-----+------------------+
    |MAX   |MIN  |AVERAGE           |
    +------+-----+------------------+
    |135695|87694|100338.61149653122|
    +------+-----+------------------+ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-14-2017
	
		
		04:23 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You can add compression when you write your data. This will speed up the saving because the size of the data will smaller. Also increase the number of partition 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-10-2017
	
		
		01:24 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks! I would be interested to learn more when you are ready to announce it.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-09-2018
	
		
		06:06 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 IBM offers free courses in Scala and other languages, they are free. There are tests at the end of the course once successful you can earn badges and showcase them.   https://cognitiveclass.ai/ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-07-2017
	
		
		06:25 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 oh! that worked. Thanks a lot! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-04-2017
	
		
		12:42 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		6 Kudos
		
	
				
		
	
		
					
							 These might help:  https://community.hortonworks.com/questions/39017/can-someone-point-me-to-a-good-tutorial-on-spark-s.html  https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-02-2017
	
		
		04:06 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 A quick hack would be to use scala "substring"   http://alvinalexander.com/scala/scala-string-examples-collection-cheat-sheet  So what you can do is write a UDF and run the "new_time" column through it and grab upto time stamp you want. For example if you want just "yyyy-MM-dd HH:MM" as seen when you run the "df.show", your sub string code will be   new_time.substring(0,15)  which will yield "2015-12-06 12:40"  pseudo code  def getDateTimeSplit = udf((new_time:String) => {
    val s = new_time.substring(0,15)
    return s
})   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-09-2017
	
		
		04:36 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Adnan Alvee that is impressive indeed, ORC has additional benefits you will see on the Hive side. Glad you found it of use. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-28-2019
	
		
		02:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 We can use rank approach which is faster than max , max scans the table twice:  Here , partition column is load_date:  select   ld_dt.txnno ,   ld_dt.txndate ,   ld_dt.custno ,   ld_dt.amount ,   ld_dt.productno ,   ld_dt.spendby ,   ld_dt.load_date   from  (select *,dense_rank() over (order by load_date desc) dt_rnk from datastore_s2.transactions)ld_dt  where ld_dt.dt_rnk=1     
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













