Member since 
    
	
		
		
		08-11-2014
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                481
            
            
                Posts
            
        
                92
            
            
                Kudos Received
            
        
                72
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3454 | 01-26-2018 04:02 AM | |
| 7090 | 12-22-2017 09:18 AM | |
| 3538 | 12-05-2017 06:13 AM | |
| 3858 | 10-16-2017 07:55 AM | |
| 11231 | 10-04-2017 08:08 PM | 
			
    
	
		
		
		08-19-2015
	
		
		12:45 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You would need to contact Cloudera Support if you believe it's a problem, if you have a support contract.  I have successfully added Spark Gateway nodes after a cluster is live though without issues, so I suspect it's something else at work here. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-18-2015
	
		
		07:30 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 It sounds like you want to have one process, not two then, if the two phases are so tied together.  Also consider using a message queue like kafka and spark streaming to process the output of one separate job in another in near-real-time.  I would not over-complicate it.     Tachyon is also an option but as far as I know it's not necessarily finished or completely integrate with Spark. I don't know if it will be. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-18-2015
	
		
		06:53 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 An RDD is bound to an application, so it can't be shared across apps. You simply persist the data (e.g. on HDFS) and read it from the other app as an RDD.     I know people think that is slow, or slow-er than sharing an RDD somehow, but it isn't if you think about what's necessary to maintain fault tolerance across apps. You'd still be persisting something somewhere besides memory. And HDFS caching can make a lot of the reading from HDFS an in-mem operation anyway. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-06-2015
	
		
		02:36 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I don't think it has to do with functional programming per se, but yes, it's because the function/code being executed has to be sent from the driver to the executors, and so the function object itself must be serializable. It has no relation to security. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-05-2015
	
		
		11:05 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 If you call persist() on an RDD, it means that the data in the RDD will be persisted but only later when something causes it to be computed for the first time. It is not immediately evaluated. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-28-2015
	
		
		10:59 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 That I don't know. THere should be something in the logs at startup, and that should be available pretty soon. I would expect you can see the logs with that command. It could be some other issue with the ports and so on, but then I think you'd see errors from YARN that it can't get to the AM container or something. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-28-2015
	
		
		01:33 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You can background the spark-submit process like any other linux process, by putting it into the background in the shell. In your case, the spark-submit job actually then runs the driver on YARN, so, it's baby-sitting a process that's already running asynchronously on another machine via YARN. Running is good; it means all is well. You can redirect this log output where you like.     Killing the driver will cause YARN to restart it, in yarn-cluster mode. You want to kill the spark-submit process, really.     I don't know why you don't see logs. Try browing to the Spark UI of the driver to see what's happening. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-27-2015
	
		
		01:41 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 I suspect it's some issue in the version of tar you may have on your system? BSD vs Gnu? Just a guess. That or maybe a corrupted file? The latest rmr2 archive uncompressed OK for me on OS X. https://github.com/RevolutionAnalytics/rmr2/releases 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-27-2015
	
		
		04:16 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 The first case is: read - shuffle - persist - count  The second case is: read (from persisted copy) - count     You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-26-2015
	
		
		11:53 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hm, is that surprising? You described why it is faster in your message. The second time, "result" does not have to be recomputed since it is available on disk. It is the result of a potentially expensive shuffle operation (coalesce) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













