Member since 
    
	
		
		
		01-03-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                181
            
            
                Posts
            
        
                44
            
            
                Kudos Received
            
        
                24
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 2265 | 12-02-2018 11:49 PM | |
| 3122 | 04-13-2018 06:41 AM | |
| 2678 | 04-06-2018 01:52 AM | |
| 2965 | 01-07-2018 09:04 PM | |
| 6505 | 12-20-2017 10:58 PM | 
			
    
	
		
		
		07-31-2017
	
		
		04:49 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @mravipati,  can you please check Dynamic Resource Allocation is enabled   spark.dynamicAllocation.enabled =true  this will use as many as it can depends up on the system rescue availability, this may be causing the problem 
 On the other note, this behaviour can be controlled by setting the   spark.dynamicAllocation.maxExecutors = <no max limit>  please note that, driver also allocated some of the containers. you need to manage the memory allocations for Executors and drivers.  for instance if you have Yarn minimum container size mentioned as 2GB and your executors are requested about 2GB per executor, this will allocated 4GB per executor as you have spark.yarn.executor.memoryOverhead also to be accounted.  the following KB explain more about the why it is taking more resources by spark. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-25-2017
	
		
		04:47 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
  reducing the tasks may impact the overall the performance(however alter also run mr and consume resources.)   post to your insert you can run a alter table statement.  more on the same can be found at ORC documentation  https://orc.apache.org/docs/hive-ddl.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-04-2017
	
		
		03:28 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 hi @tariq abughofa,  could you please SELinux Disabled or not on the driver, which looks preventing new dynamic ports refuse to connect. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-15-2017
	
		
		12:31 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @Abhijeet Rajput,  In response to handling the
huge SQL, Spark does lazy evolution  which means you can split
your code into multiple blocks and write using the multiple data frames.  That will be evaluated at
last and uses the optimal execution plan that can accommodate for the operation.  Example :
var subquery1 = sql (“select
c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”)
subquery1.registerTempTable(“res1”)
var subquery2 = sql (“select
c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on….  On the other request, there
is no difference between using the DataFrame base API or SQL as the same execution
plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-14-2017
	
		
		07:08 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 hi @rahul gulati,  Apparently, number of
partitions for your DataFrame / RDD is creating the issue.  This can be controlled by adjusting
the spark.default.parallelism parameter in spark context or by using
.repartition(<desired number>)  When you run in spark-shell
please check the mode and number of cores allocated for the execution and
adjust the value to which ever is working for the shell mode  Alternatively you can observe
the same form Spark UI and come to a conclusion on partitions.   # from spark website on spark.default.parallelism  For distributed shuffle
operations like reduceByKey and join, the largest number of
partitions in a parent RDD. For operations like parallelize with no
parent RDDs,   it
depends on the cluster manager:  
 
 Local mode: number of cores on the local machine  
 Others: total number of cores on all executor
     nodes or 2, whichever is larger  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-14-2017
	
		
		04:33 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @Jean-Sebastien Gourdet,  There are couple of options
available to reduce the shuffle (not eliminate in some cases)   Using the broadcast
variables   By using
the broad cast variable, you can eliminate the shuffle of a big table, however
you must broadcast the small data across all the executors   This
may not be feasible all the cases, if both tables are big.     The other
alternative (good practice to implement) is to implement the predicated
pushdown for Hive data, this filters only the data which is required for the
computation at the Hive Level and extract small amount of data.   This may not avoid
complete shuffle but certainly speed up the shuffle as the amount of the data
which pulled to memory will reduce significantly ( in some cases)  sqlContext.setConf("spark.sql.orc.filterPushdown",
"true")  -- If
you are using ORC files / spark.sql.parquet.filterPushdown in case of Parquet
files.  
 Last but not recommended approach is
to extract form single partition by keeping the option .repartitin(1) to the DataFrame
you will be avoided the shuffle but all the data will not count on parallelism
as the single executor participate on the operation.   On the other note, the
shuffle will be quick if the data is evenly distributed (key being used to join
the table). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-25-2017
	
		
		09:23 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Sridhar Babu,  Apparently there is an issue with library in compatable with2.11:1.3.0 and 2.11:1.4.0  please use verison com.databricks:spark-csv_2.10:1.4.0  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-22-2017
	
		
		03:27 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @Mehdi Hosseinzadeh,  From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed.   Read the data From HTTP using Spark Streaming job and write into Kafka  Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc..  Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS   Accessing the data from external tables has been discussed here 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-21-2017
	
		
		10:20 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Sushant  Glad that worked for you, in case can you please accept that answer.  in response to control the resource monitoring, not that I am aware of, but I believe you may 'not' need to prevent user 1 to see the application submitted by user1 or other user. as this does not contain any data (unless explisitly prints out to STDOUT).  on the other hand you can manage the access with (authorization)SPNEGO for web UI. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-16-2017
	
		
		10:11 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Sudeep Mishra  Please pass the user keytab in along with spark-submit command.  --files /<key_tab_location>/<user_keytab.keytab>  This is due to the executors are not authenticated to extract the data from HBase Region servers or any other components.  by passing the keytab all the executors will have the key-tab and able to communicate  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		- « Previous
- Next »
 
         
					
				













