About stevel

stevel · ‎06-23-2021

@Arjun_bedi I'm afraid you've just hit a problem which we've only just started encountering: HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain." This failure surfaces when _all_ the following conditions are met: Deployment outside EC2. Configuration option `fs.s3a.endpoint` is unset. Without the file `~/.aws/config` existing or without a region set in it. Without the JVM system property `aws.region` declaring a region. Without the environment variable `AWS_REGION` declaring a region. You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml <property> <name>fs.s3a.endpoint</name> <value>s3.amazonaws.com</value> </property> in your scala code: sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com") Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created. We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.

stevel · ‎03-11-2020

I'm going to point you at the S3A troubleshooting docs, where we try to match error messages to root causes, though "bad request" is a broad issue -one AWS don't provide details on for security reasons https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Troubleshooting_S3A for a us-west-2 endpoint you can/should just stick with the main endpoint. If you do change, you may have to worry about s3 signing algorithms. Depending on the specific version of CDH you are using that's a hadoop config option; for the older versions, it's a JVM property which is tricky to propagate over hadoop application deployments. Summary: * try to just stick to the central endpoint * if you need to use a "V4 only endpoint", try and use the most recent version of CDH you can and use the fs.s3a.signing.algorithm option

stevel · ‎11-30-2018

Sorry, missed this. the issue here is that "S3" isn't a "real" filesystem, there's no file/directory rename, and instead we have to list every file created and copy it over. Which relies on listings being correct, which S3, being eventually consistent, doesn't always hold up. Looks like you've hit an inconsistency on a job commit To get consistent listings (HDP 3) enable S3Guard To avoid the slow rename process and the problems caused by inconistency within a single query, switch to the "S3A Committers" which come with Spark on HDP-3.0. These are specially designed to safely write work into S3 If you can't do either of those, you cannot safely use S3 as a direct destination of work. You should write into HDFS and then, afterwards, copy it to S3.

stevel · ‎11-30-2018

@Indra s: with the S3A connector you can use per-bucket configuration options to set a different username/pass for the remote bucket fs.s3a.bucket.myaccounts3.access.key=AAA12 fs.s3a.bucket.myaccounts3.secret.key=XXXYYY Then when you read or write s3a://myaccounts3/ then these specific username/passwords are used. For other S3A buckets, the default ones are picked up: fs.s3a.access.key=BBBB fs.s3a.secret.key=ZZZZZ Please switch to using the s3a:// connector everywhere: its got much better performance and functionality than the older S3N one, which has recently been removed entirely.

stevel · ‎06-06-2018

Dominika: I need to add: S3 is not a real filesystem. You cannot safely use AWS S3 it as a replacement for HDFS without a metadata consistency layer, and even then the eventual consistency of S3 updates and deletes cause problems. you can safely use it as a source of data. To use as a direct destination of work takes care: consult the documentation specific to the version of Hadoop you are using before trying to make S3 the default filesystem. Special case: third party object stores with full consistency. The fact that directory renames are not atomic may still cause problems with commit algorithms and the like, but the risk of corrupt data in the absence of failures is gone.

stevel · ‎04-02-2018

What's the full stack? If the job doesn't create a _SUCCESS file then the overall job failed. A destination directory will have been created, because job attempts are created underneath it. When tasks and jobs are committed they rename things...if there's a failure in any of those operations then something has gone wrong. Like said, post the stack trace and I'll try to make sense of it

stevel · ‎04-02-2018

If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server

stevel · ‎02-27-2018

Sounds like something is failing in that 200GB upload. I'd turn off the fs.s3a.fast.upload. In HDP-2.5 its buffering into RAM, and if more data is queued for upload than there's room for in the JVM heap, the JVM will fail...which will trigger the retry. You will also need enough space on the temp disk for the whole file. In HDP 2.6+ we've added disk buffering for the in-progress uploads, and enable that by default.

stevel · ‎02-27-2018

1.please avoid putting secrets in your paths; it invariably ends up in a log somewhere. Set the options fs.access.key and fs.secret.key instead. 2. Try backing up to a subdirectory. Root directories are "odd" 3. What happens when you a hadoop fs -ls s3a://bucket/path-to-backup? That should see if the file is there.

stevel · ‎02-13-2018

let's just say there's "ambiguity" about how root directories are treated in object stores and filesystems, and rename() is a key troublespot everywhere. It's known there are quirks here, but as normal s3/wasb/adl useage goes to subdirectories, nobody has ever sat down with HDFS to argue the subtleties of renaming something into the root directory

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: AWS_REGION environment variable declaration

Re: s3a issue when endpoint set to fs.s3a.endpoint...

Re: How to Export DF data to S3 bucket

Re: spark read from different account s3 and write...

Re: Using S3 as DefaultFs

Re: java.io.IOException: Failed to rename FileStat...

Re: Copy large number of massive files from local ...

Re: Each map task of a distcp job to s3 is running...

Re: Hbase Restore using the Backup ID from S3 thro...

Re: Hive query on prem writing to S3 fails bc of r...