Member since 
    
	
		
		
		06-20-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                251
            
            
                Posts
            
        
                196
            
            
                Kudos Received
            
        
                36
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 10789 | 11-08-2017 02:53 PM | |
| 2539 | 08-24-2017 03:09 PM | |
| 8956 | 05-11-2017 02:55 PM | |
| 8655 | 05-08-2017 04:16 PM | |
| 2492 | 04-27-2017 08:05 PM | 
			
    
	
		
		
		06-28-2016
	
		
		03:31 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 
	@Ravikumar Kumashi  
	Make sure your VM is up and sshd is running and listening on port 2222: sudo netstat -anp | grep sshd  Make sure no firewall rules are getting in the way.  If confirmed, try using 127.0.0.1 instead of localhost and if that doesn't work try editing your hosts file so that sandbox.hortonworks.com resolves to 127.0.0.1 and then use the FQDN sandbox.hortonworks.com instead of localhost.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		03:07 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Ravikumar Kumashi in Cygwin, you can access the root of your C: drive by specifying the directory /cygdrive/c. So the path would be /cygdrive/c/Users/rnkumashi/Downloads/sample.txt  This is one of the reasons I recommended pscp. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		02:42 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @mayki wogno  It's essentially non-HDFS data in dfs.datanode.data.dir. This could include log files, intermediate shuffle output from MapReduce jobs, local data files (if you put them on a data node), etc. You can use du or a similar tool to investigate further. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		02:34 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @mayki wogno   "Non DFS used" can be calculated by the following formula:  Non DFS Used = Configured Capacity - Remaining Space - DFS Used  Noting that Configured Capacity = Total Disk Space - Reserved Space  Therefore, Non DFS Used = (Total Disk Space - Reserved Space) - Remaining Space - DFS Used  Reserved Space is set by the property dfs.datanode.du.reserved 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		02:14 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 I would recommend looking into pscp if you are on a Windows platform. If using Cygwin, you'll need to install scp, see http://stackoverflow.com/questions/18688502/how-do-i-download-scp-and-ssh-on-cygwin. scp is part of the openssh package as noted. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		02:13 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @alain TSAFACK  val VAL1 = "testcol"
val df = HiveContext.sql(s"SELECT * FROM src WHERE col1 = $VAL1")
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		01:55 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Ravikumar Kumashi yes, that is correct, you want to run the command from your local machine since that is where the file lives that you are scp'ing over to the sandbox. You can invoke via Cygwin or you can use pscp (from the makers of Putty) and run pscp from the Windows command line. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		01:39 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Simran Kaur an example of using Hive for data cleansing is in this article (see section 3.5 in particular).  Regarding Spark, it is used widely for extract, transformation, and load logic and is usually well-suited for those kinds of use cases. Both MapReduce and Spark are very general computation paradigms. It would be helpful to know what data cleaning transformations you have in mind.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-28-2016
	
		
		01:23 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Ravikumar Kumashi the scp command is missing the port number (please notice that the "usage" text was returned by the command, which means the syntax was incorrect). Please try specifying 2222 after the -P switch. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-27-2016
	
		
		06:32 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Bharath Kumar K you may want to look into pscp if you want to run from the command line and resolve network drive mappings in the Windows fashion. I am not sure how you are running #1 (from Cygwin maybe?), but the syntax in the first example is essentially correct.  #3 should work, what error are you receiving? One thing you might want to try is creating a hosts entry so that sandbox.hortonworks.com resolves to 127.0.0.1 and then using sandbox.hortonworks.com as your hostname/IP. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		- « Previous
- Next »
 
         
					
				













