Member since 
    
	
		
		
		05-02-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                360
            
            
                Posts
            
        
                65
            
            
                Kudos Received
            
        
                22
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 15719 | 02-20-2018 12:33 PM | |
| 2050 | 02-19-2018 05:12 AM | |
| 2382 | 12-28-2017 06:13 AM | |
| 7925 | 09-28-2017 09:25 AM | |
| 13515 | 09-25-2017 11:19 AM | 
			
    
	
		
		
		01-30-2018
	
		
		05:13 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi,  I have certain set of question which Im trying to understand in spark which are mentioned below:  What the best compression codec that can be used in spark. In hadoop we should not use gz compression unless it is cold data where input splits of very less use. But if we were to choose any other compression w.r.t (lzo/bzip2/snappy etc) then based on what parameters do we need to choose the compressions?  Does spark makes use of the input splits if the files are compressed?  How does spark handles compression when compared with MR?  Does compression increases the amount of data which is being shuffled?  Thanks in advance!! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Hadoop
- 
						
							
		
			Apache Spark
			
    
	
		
		
		01-05-2018
	
		
		05:26 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 This would also work.  import java.io.File   val files = getListOfFiles("/tmp")    def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-05-2018
	
		
		05:23 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Chaitanya D   It will be possible in with unix and spark combination.  hadoop fs -ls /filedirectory/*txt_processed  Above command will return the desired file you need. Then pass the result to spark and process the file as you need.  Alternatively in spark you can select the desired file using the below command.   val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!!   Hope it helps ! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-03-2018
	
		
		12:05 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Alexandros Biratsis i believe that you are not using Insert Overwrite when inserting the incremental records into the target. Assuming that its weird how the data is being overridden.   For the union part --> If you wanted to avoid union then you may have to may have to perform left join between the incremental data and target to apply some transformations ( assuming that you are performing SCD type 1). If you wanted to just append the data then insert the incremental data through multiple queries into the target one by one. By if you are inserting the data multiple times then the no of jobs will be more which be more or less equal to performing union over it.  Sorry for the late reply. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-29-2017
	
		
		06:18 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @Alexandros Biratsis  I could see the work around available in the link which you have mentioned. Anyways let me add few points on top of it.  Create a work table  Perform union between target and incremental data and insert into the newly created work table  Assuming that you are using only external table --> Drop the work table. re-create the target table pointing to the work table location so that you can avoid re-loading the target from the work table.  Hope it helps!  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-28-2017
	
		
		06:13 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Sebastien F   Background execution of tez and mr has many similarities. Differences lies in the where the data are in placed to transform it. Tez uses DAG to process the data whereas mr doesn't use DAG.   This link would answer your question. Hope it helps!! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-27-2017
	
		
		09:44 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Ashnee Sharma   How many executors are in place? Also are you firing the query in spark-sql directly? What is the size of the table which you are fetching? Try increasing the partitions manually instead of letting spark deciding the no of partitions.  No of partitions can be decided based on the table size which has to be splitted across executors.  Use the below properties  set by SparkConf: conf.set("spark.driver.maxResultSize", "3g")   set by spark-defaults.conf: spark.driver.maxResultSize 3g  set when calling spark-submit: --conf spark.driver.maxResultSize=3g  I believe the above property should work. I could see that you have increase the driver size already. If so then ignore the driver size change property. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-20-2017
	
		
		09:44 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Sandeep SIngh  No, hive doesn't maintain any lock history.  show locks;  Above command would help you to get the user who has acquired a lock over the table in hive. But however if the lock is released then you will not be able to see the user who has acquired the locks. Also there is no history of locks being recorded as it is not necessarily needed for the namenode for any computation.  Hope it helps!  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-20-2017
	
		
		08:46 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Ashnee Sharma  Based on the logs I could see that when you run a count query it triggers a mapreduce job and it takes time. Could you check running this command (set hive.stats.fetch.column.stats) and verify that it status is true?  Because when this property is enable then stats should be fetched based on stats information available in the metastore which will not trigger any jobs when you run a count query. It should work regardless whether you are using mr/tez as your execution engine.   Hope it helps!! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-20-2017
	
		
		06:09 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Got it ! Thanks @James Dinkel 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













