Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2571 | 02-27-2018 04:47 PM | |
5032 | 03-03-2017 10:04 PM | |
2597 | 02-16-2017 10:18 AM | |
1357 | 01-20-2017 02:15 PM | |
10457 | 01-20-2017 02:02 PM |
06-23-2021
10:09 AM
@Arjun_bedi I'm afraid you've just hit a problem which we've only just started encountering: HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain." This failure surfaces when _all_ the following conditions are met: Deployment outside EC2. Configuration option `fs.s3a.endpoint` is unset. Without the file `~/.aws/config` existing or without a region set in it. Without the JVM system property `aws.region` declaring a region. Without the environment variable `AWS_REGION` declaring a region. You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml <property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property> in your scala code: sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com") Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created. We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.
... View more
03-11-2020
09:08 AM
I'm going to point you at the S3A troubleshooting docs, where we try to match error messages to root causes, though "bad request" is a broad issue -one AWS don't provide details on for security reasons https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Troubleshooting_S3A for a us-west-2 endpoint you can/should just stick with the main endpoint. If you do change, you may have to worry about s3 signing algorithms. Depending on the specific version of CDH you are using that's a hadoop config option; for the older versions, it's a JVM property which is tricky to propagate over hadoop application deployments. Summary: * try to just stick to the central endpoint * if you need to use a "V4 only endpoint", try and use the most recent version of CDH you can and use the fs.s3a.signing.algorithm option
... View more
11-30-2018
04:20 PM
Sorry, missed this. the issue here is that "S3" isn't a "real" filesystem, there's no file/directory rename, and instead we have to list every file created and copy it over. Which relies on listings being correct, which S3, being eventually consistent, doesn't always hold up. Looks like you've hit an inconsistency on a job commit To get consistent listings (HDP 3) enable S3Guard To avoid the slow rename process and the problems caused by inconistency within a single query, switch to the "S3A Committers" which come with Spark on HDP-3.0. These are specially designed to safely write work into S3 If you can't do either of those, you cannot safely use S3 as a direct destination of work. You should write into HDFS and then, afterwards, copy it to S3.
... View more
11-30-2018
04:15 PM
@Indra s: with the S3A connector you can use per-bucket configuration options to set a different username/pass for the remote bucket fs.s3a.bucket.myaccounts3.access.key=AAA12 fs.s3a.bucket.myaccounts3.secret.key=XXXYYY Then when you read or write s3a://myaccounts3/ then these specific username/passwords are used. For other S3A buckets, the default ones are picked up: fs.s3a.access.key=BBBB fs.s3a.secret.key=ZZZZZ Please switch to using the s3a:// connector everywhere: its got much better performance and functionality than the older S3N one, which has recently been removed entirely.
... View more
06-06-2018
11:57 AM
Dominika: I need to add: S3 is not a real filesystem. You cannot safely use AWS S3 it as a replacement for HDFS without a metadata consistency layer, and even then the eventual consistency of S3 updates and deletes cause problems. you can safely use it as a source of data. To use as a direct destination of work takes care: consult the documentation specific to the version of Hadoop you are using before trying to make S3 the default filesystem. Special case: third party object stores with full consistency. The fact that directory renames are not atomic may still cause problems with commit algorithms and the like, but the risk of corrupt data in the absence of failures is gone.
... View more
04-02-2018
01:22 PM
What's the full stack? If the job doesn't create a _SUCCESS file then the overall job failed. A destination directory will have been created, because job attempts are created underneath it. When tasks and jobs are committed they rename things...if there's a failure in any of those operations then something has gone wrong. Like said, post the stack trace and I'll try to make sense of it
... View more
04-02-2018
01:19 PM
If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server
... View more
02-27-2018
04:56 PM
Sounds like something is failing in that 200GB upload. I'd turn off the fs.s3a.fast.upload. In HDP-2.5 its buffering into RAM, and if more data is queued for upload than there's room for in the JVM heap, the JVM will fail...which will trigger the retry. You will also need enough space on the temp disk for the whole file. In HDP 2.6+ we've added disk buffering for the in-progress uploads, and enable that by default.
... View more
02-27-2018
04:47 PM
1.please avoid putting secrets in your paths; it invariably ends up in a log somewhere. Set the options fs.access.key and fs.secret.key instead. 2. Try backing up to a subdirectory. Root directories are "odd" 3. What happens when you a hadoop fs -ls s3a://bucket/path-to-backup? That should see if the file is there.
... View more
02-13-2018
04:00 PM
let's just say there's "ambiguity" about how root directories are treated in object stores and filesystems, and rename() is a key troublespot everywhere. It's known there are quirks here, but as normal s3/wasb/adl useage goes to subdirectories, nobody has ever sat down with HDFS to argue the subtleties of renaming something into the root directory
... View more