Member since 
    
	
		
		
		09-26-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                135
            
            
                Posts
            
        
                85
            
            
                Kudos Received
            
        
                26
            
            
                Solutions
            
        About
	
            Steve's a hadoop committer mostly working on cloud integration
        
My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3459 | 02-27-2018 04:47 PM | |
| 5929 | 03-03-2017 10:04 PM | |
| 3554 | 02-16-2017 10:18 AM | |
| 1884 | 01-20-2017 02:15 PM | |
| 11891 | 01-20-2017 02:02 PM | 
			
    
	
		
		
		06-23-2021
	
		
		10:09 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Arjun_bedi   I'm afraid you've just hit a problem which we've only just started encountering:   HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain."     This failure surfaces when _all_ the following conditions are met:   Deployment outside EC2.  Configuration option  `fs.s3a.endpoint` is unset.  Without the file `~/.aws/config` existing or without a region set in it.  Without the JVM system property `aws.region` declaring a region.   Without the environment variable `AWS_REGION` declaring a region.      You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml        <property>
  <name>fs.s3a.endpoint</name>
  <value>s3.amazonaws.com</value>
</property>     in your scala code:  sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com")     Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created.     We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-11-2020
	
		
		09:08 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I'm going to point you at the S3A troubleshooting docs, where we try to match error messages to root causes, though "bad request" is a broad issue -one AWS don't provide details on for security reasons     https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Troubleshooting_S3A     for a us-west-2 endpoint you can/should just stick with the main endpoint. If you do change, you may have to worry about s3 signing algorithms. Depending on the specific version of CDH you are using that's  a hadoop config option; for the older versions, it's a JVM property which is tricky to propagate over hadoop application deployments.      Summary:  * try to just stick to the central endpoint  * if you need to use a "V4 only endpoint", try and use the most recent version of CDH you can and use the fs.s3a.signing.algorithm option 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-30-2018
	
		
		04:20 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							   Sorry, missed this.  the issue here is that "S3" isn't a "real" filesystem, there's no file/directory rename, and instead we have to list every file created and copy it over. Which relies on listings being correct, which S3, being eventually consistent, doesn't always hold up. Looks like you've hit an inconsistency on a job commit  To get consistent listings (HDP 3) enable S3Guard  To avoid the slow rename process and the problems caused by inconistency within a single query, switch to the "S3A Committers" which come with Spark on HDP-3.0. These are specially designed to safely write work into S3  If you can't do either of those, you cannot safely use S3 as a direct destination of work. You should write into HDFS and then, afterwards, copy it to S3. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-30-2018
	
		
		04:15 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Indra s: with the S3A connector you can use per-bucket configuration options to set a different username/pass for the remote bucket  fs.s3a.bucket.myaccounts3.access.key=AAA12  fs.s3a.bucket.myaccounts3.secret.key=XXXYYY  Then when you read or write s3a://myaccounts3/ then these specific username/passwords are used. For other S3A buckets, the default ones are picked up:  fs.s3a.access.key=BBBB  fs.s3a.secret.key=ZZZZZ  Please switch to using the s3a:// connector everywhere: its got much better performance and functionality than the older S3N one, which has recently been removed entirely. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-06-2018
	
		
		11:57 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Dominika: I need to add: S3 is not a real filesystem. You cannot safely use AWS S3 it as a replacement for HDFS without a metadata consistency layer, and even then the eventual consistency of S3 updates and deletes cause problems.   you can safely use it as a source of data. To use as a direct destination of work takes care: consult the documentation specific to the version of Hadoop you are using before trying to make S3 the default filesystem.  Special case: third party object stores with full consistency. The fact that directory renames are not atomic may still cause problems with commit algorithms and the like, but the risk of corrupt data in the absence of failures is gone. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-02-2018
	
		
		01:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 What's the full stack?   If the job doesn't create a _SUCCESS file then the overall job failed. A destination directory will have been created, because job attempts are created underneath it. When tasks and jobs are committed they rename things...if there's a failure in any of those operations then something has gone wrong.  Like said, post the stack trace and I'll try to make sense of it 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-02-2018
	
		
		01:19 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-27-2018
	
		
		04:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Sounds like something is failing in that 200GB upload. I'd turn off the fs.s3a.fast.upload. In HDP-2.5 its buffering into RAM, and if more data is queued for upload than there's room for in the JVM heap, the JVM will fail...which will trigger the retry. You will also need enough space on the temp disk for the whole file.  In HDP 2.6+ we've added disk buffering for the in-progress uploads, and enable that by default.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-27-2018
	
		
		04:47 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 1.please avoid putting secrets in your paths; it invariably ends up in a log somewhere. Set the options fs.access.key and fs.secret.key instead.  2. Try backing up to a subdirectory. Root directories are "odd"  3. What happens when you a hadoop fs -ls s3a://bucket/path-to-backup? That should see if the file is there. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-13-2018
	
		
		04:00 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 let's just say there's "ambiguity" about how root directories are treated in object stores and filesystems, and rename() is a key troublespot everywhere. It's known there are quirks here, but as normal s3/wasb/adl useage goes to subdirectories, nobody has ever sat down with HDFS to argue the subtleties of renaming something into the root directory 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













