Member since 
    
	
		
		
		05-17-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                190
            
            
                Posts
            
        
                46
            
            
                Kudos Received
            
        
                11
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1744 | 09-07-2017 06:24 PM | |
| 2310 | 02-24-2017 06:33 AM | |
| 3441 | 02-10-2017 09:18 PM | |
| 7978 | 01-11-2017 08:55 PM | |
| 5921 | 12-15-2016 06:16 PM | 
			
    
	
		
		
		06-08-2016
	
		
		01:53 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks @clukasik. Is there any performance difference in choosing client deploy-mode over the cluster mode.If I use the default client deploy mode, I get the control on where my driver program runs. However, wanted to be sure that it does not cause any performance issue.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-08-2016
	
		
		01:36 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-08-2016
	
		
		01:33 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks for the suggestion @Jitendra Yadav
But, the file being small <~ 500 KB, I was thinking if we need to have that loaded to HDFS. Was looking for some "hack"  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-08-2016
	
		
		01:26 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi,   One of the spark application depends on a local file for some of its business logics.   We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.   Is there any other way of achieving this? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache Spark
 
			
    
	
		
		
		06-06-2016
	
		
		02:28 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks @clukasik. Got it!! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-06-2016
	
		
		02:16 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks @clukasik. That solves the problem. I was going an unwanted circle to address this. 
++ on the second part of the question, does it make any sense in parallelizing a list before actually storing it to a file? As in the last 2 lines of my code.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-06-2016
	
		
		01:21 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi All,   Need recommendation on the best approach for solving the below problem. I have included the code snippet that I have done.   I read a hdfs file using a custom input format and in turn get a PairRdd. Now I am interested in operating on the value one at a time and I am not bothered of the key.   Is a java list a scalable data structure to hold the values? Please have a look at the code below and suggest alternates. Also does the parallelize at the end of code give any benefit?   JavaPairRDD<LongWritable, BytesWritable> fixedFileRdd = getItSomeHow();
List<String>zeroValue = new ArrayList<String>();
Function2<List<String>, Tuple2<LongWritable, BytesWritable>, List<String>> seqOp = new Function2<List<String>, Tuple2<LongWritable,BytesWritable>, List<String>>() {
public List<String> call(List<String> valueList, Tuple2<LongWritable, BytesWritable> eachKeyValue) throws Exception {
valueList.add(doWhatever(new String(eachKeyValue._2.copyBytes())));
returnvalueList;
}
private String doWhatever(String string) {
// will be an external utility method call, this is for representational purpose only
return System.currentTimeMillis()+"-"+string;
}
};
Function2<List<String>, List<String>, List<String>> combOp = new Function2<List<String>, List<String>, List<String>>() {
public List<String> call(List<String> listOne, List<String> listTwo) throws Exception {
 listOne.addAll(listTwo);
 return listOne;
}
};
List<String> resultantList = fixedFileRdd.aggregate(zeroValue, seqOp , combOp );
JavaRDD<String> resultantRdd = jsc.parallelize(resultantList);
resultantRdd.saveAsTextFile("out-dir"); 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache Spark
 
			
    
	
		
		
		06-01-2016
	
		
		05:19 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Got it, added an action first() to make it forcefully trigger. And yes, the reason that you mentioned "spark actions are lazily evaluted" was what stopped me.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		- « Previous
 - Next »