Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1324 | 02-27-2018 04:47 PM | |
3706 | 03-03-2017 10:04 PM | |
1293 | 02-16-2017 10:18 AM | |
697 | 01-20-2017 02:15 PM | |
6962 | 01-20-2017 02:02 PM |
06-23-2021
10:09 AM
@Arjun_bedi I'm afraid you've just hit a problem which we've only just started encountering: HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain." This failure surfaces when _all_ the following conditions are met: Deployment outside EC2. Configuration option `fs.s3a.endpoint` is unset. Without the file `~/.aws/config` existing or without a region set in it. Without the JVM system property `aws.region` declaring a region. Without the environment variable `AWS_REGION` declaring a region. You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml <property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property> in your scala code: sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com") Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created. We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.
... View more
03-11-2020
09:08 AM
I'm going to point you at the S3A troubleshooting docs, where we try to match error messages to root causes, though "bad request" is a broad issue -one AWS don't provide details on for security reasons https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Troubleshooting_S3A for a us-west-2 endpoint you can/should just stick with the main endpoint. If you do change, you may have to worry about s3 signing algorithms. Depending on the specific version of CDH you are using that's a hadoop config option; for the older versions, it's a JVM property which is tricky to propagate over hadoop application deployments. Summary: * try to just stick to the central endpoint * if you need to use a "V4 only endpoint", try and use the most recent version of CDH you can and use the fs.s3a.signing.algorithm option
... View more
11-30-2018
04:20 PM
Sorry, missed this. the issue here is that "S3" isn't a "real" filesystem, there's no file/directory rename, and instead we have to list every file created and copy it over. Which relies on listings being correct, which S3, being eventually consistent, doesn't always hold up. Looks like you've hit an inconsistency on a job commit To get consistent listings (HDP 3) enable S3Guard To avoid the slow rename process and the problems caused by inconistency within a single query, switch to the "S3A Committers" which come with Spark on HDP-3.0. These are specially designed to safely write work into S3 If you can't do either of those, you cannot safely use S3 as a direct destination of work. You should write into HDFS and then, afterwards, copy it to S3.
... View more
11-30-2018
04:15 PM
@Indra s: with the S3A connector you can use per-bucket configuration options to set a different username/pass for the remote bucket fs.s3a.bucket.myaccounts3.access.key=AAA12 fs.s3a.bucket.myaccounts3.secret.key=XXXYYY Then when you read or write s3a://myaccounts3/ then these specific username/passwords are used. For other S3A buckets, the default ones are picked up: fs.s3a.access.key=BBBB fs.s3a.secret.key=ZZZZZ Please switch to using the s3a:// connector everywhere: its got much better performance and functionality than the older S3N one, which has recently been removed entirely.
... View more
06-06-2018
12:17 PM
@quilkpoac If your SAN supports the AWS authentication mechanisms then yes, you can use it. I'll call out the Western Digital store as one I know works: they've been very busy in the open source side of things. For other stores, tuning the authentication options is the usual troublespot Start by pointing the clients at your local store by setting fs.s3a.endpoint to the hostname of the service. Probably also set fs.s3a.path.style.access to true, unless your system creates a DNS entry for every bucket. After that, it's down to playing with authentication. The propery fs.s3a.signing-algorithm is passed straight down to the AWS SDK here; a quick glance at its implementation implies it can be one of: NoOpSignerType, AWS4UnsignedPayloadSignerType, AWS3SignerType, AWS4SignerType and QueryStringSignerType. The v4 signing API is new and unlikely to work; the S3A default is the v3 one
... View more
06-06-2018
11:57 AM
Dominika: I need to add: S3 is not a real filesystem. You cannot safely use AWS S3 it as a replacement for HDFS without a metadata consistency layer, and even then the eventual consistency of S3 updates and deletes cause problems. you can safely use it as a source of data. To use as a direct destination of work takes care: consult the documentation specific to the version of Hadoop you are using before trying to make S3 the default filesystem. Special case: third party object stores with full consistency. The fact that directory renames are not atomic may still cause problems with commit algorithms and the like, but the risk of corrupt data in the absence of failures is gone.
... View more
04-02-2018
01:28 PM
I'd consider setting the delimiter to the main one ";", then use string.indexof/string.substring to split the field values up, and emit the values into some structure which isolates each one, preferably ORC. Once you've saved it as ORC, then you can do queries over that, again, into a structured format. For final conversion into your own chosen standard, well, unless you are going to implement your own format (worth considering, actually), just do a query which selects the columns you want and then just use String.format() to build up the strings. You'll need a story for null there though. Finally, while ORC is a great format for querying, you might want to think about Avro as a simple data exchange format as it includes schemas and is straightforward to parse.
... View more
04-02-2018
01:22 PM
What's the full stack? If the job doesn't create a _SUCCESS file then the overall job failed. A destination directory will have been created, because job attempts are created underneath it. When tasks and jobs are committed they rename things...if there's a failure in any of those operations then something has gone wrong. Like said, post the stack trace and I'll try to make sense of it
... View more
04-02-2018
01:19 PM
If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server
... View more
04-02-2018
01:15 PM
Can I add that putting secrets in your s3a:// path is dangerous as it will end up in hadoop logs across the cluster Best: put them in a JCEKs file in HDFS or other secure keystore Good: have some options in the hadoop/hbase configurations Weak: setting them on the command line with -D options (visible with a ps command)
... View more
02-27-2018
04:56 PM
Sounds like something is failing in that 200GB upload. I'd turn off the fs.s3a.fast.upload. In HDP-2.5 its buffering into RAM, and if more data is queued for upload than there's room for in the JVM heap, the JVM will fail...which will trigger the retry. You will also need enough space on the temp disk for the whole file. In HDP 2.6+ we've added disk buffering for the in-progress uploads, and enable that by default.
... View more
02-27-2018
04:47 PM
1.please avoid putting secrets in your paths; it invariably ends up in a log somewhere. Set the options fs.access.key and fs.secret.key instead. 2. Try backing up to a subdirectory. Root directories are "odd" 3. What happens when you a hadoop fs -ls s3a://bucket/path-to-backup? That should see if the file is there.
... View more
02-13-2018
04:00 PM
let's just say there's "ambiguity" about how root directories are treated in object stores and filesystems, and rename() is a key troublespot everywhere. It's known there are quirks here, but as normal s3/wasb/adl useage goes to subdirectories, nobody has ever sat down with HDFS to argue the subtleties of renaming something into the root directory
... View more
01-12-2018
09:34 AM
There's a risk here that you are being burned by Jackson versions. the AWS SDK needs one set of Jackson jars, Spark uses another. On a normal spark-submit, everything works because Spark has shaded theirs, The IDE doesn't do that (lovely as IntelliJ is), so it refuses to play. FWIW, I hit the same problem. The workaround I use is: start the job as an executable but have the spark-submit pause for a while, and then attach the IDE to it via "attach to a local process". How to get it to wait? Simplest: put a sleep() in. Most flexible, have it poll for a file existing and then do a sleep(1000) if it isn't and repeat. That way, all you have to do is create that file and it will set off.
... View more
09-11-2017
10:32 AM
S3A actually has an extra option to let you set per-bucket jceks files, fs.s3a.security.credential.provider.path This takes the same values as the normal one, but lets you take advantage of the per-bucket config feature of s3a, where every bucket-specific option of fs.s3a.bucket.* is remapped to fs.s3a.* before the bucket is set up.
you should be able to add a reference to it likes so
spark.hadoop.fs.s3a.bucket.b.security.credential.provider.path hdfs:///something.jceks
Hopefully this helps. One challenge we always have with the authentication work is that we can't log it at the detail we'd like, because that would leak secrets too easily...so even when logging at debug, not enough information gets printed. Sorry
see also: https://hortonworks.github.io/hdp-aws/s3-security/index.html
Oh, one more thing. spark-submit copies your local AWS_ environment variables over to the fs.s3a.secret,key and fs.s3a.access.key values. Try unsetting them before you submit work and see if that makes a difference
... View more
09-11-2017
10:24 AM
Hadoop is pretty fussy about networking and DNS, including, sometimes, reverse DNS. So you looking at the DNS cache could be going in the right direction. One thing I'd suspect is that its taking reverse DNS too long before it gives up, or there's some attempt being made to look up the local hostname every time some part of the system tries to connect to another. Ideally, everything should be listening on and connecting to localhost, and there's an entry in /etc/hosts mapping localhost to 127.0.0.1. This isn't an Ubuntu Linux box is it, incidentally? They do very odd things with the host table.
... View more
06-29-2017
07:54 PM
2 Kudos
Ok, you've found a new problem. Congratulations. Or commisserations. Filing a bug against that ().
the codepath triggering this should only be reached if fs.s3a.security.credential.provider.path is set. That should only be needed if you are hoping to provide a specific set of credentials for different buckets, so customising it for the different bucket (fs.s3a.bucket.dev-1.security.credential.provider.path=/secrets/dev.jceks) etc. If you have one set of secrets for all S3 buckets, set it in the main config for everything. Which you are trying to on the second attempt. Maybe @lmccay has some suggestion.
... View more
06-29-2017
07:45 PM
That error from AWS suspected to be the S3 connection being broken, and the XML parser in the Amazon SDK getting the end of the document & failing. I'm surprised you are seeing it frequently though; it's generally pretty rare (i.e. rare enough that we've not got that much details on what is going on). It might be fs.s3a.connection.timeout is the parameter to tune, but the other possiblity is that you have too many threads/tasks talking to S3 and either your network bandwidth is used up or AWS S3 is actually throttling you. Try smaller values of fs.s3a.threads.max (say 64 or fewer) and of fs.s3a.max.total.tasks (try 128). That cuts down the # of threads which may write at a time, and then has a smaller queue of waiting blocks to write before it blocks whatever thread is actually generating lots of of data.
... View more
03-24-2017
11:19 AM
Yes, I'm afraid that fast upload can overload the buffers in Hadoop 2.5, as it uses JVM heap to store blocks while it uploads them. The bigger the mismatch between the data generated (i.e. how fast things can be read) and the upload bandwidth, the more heap you need. On a long-haul upload you usually have limited bandwidth, and the more distcp workers, the more the bandwidth is divided between them, the bigger the mismatch a In Hadoop 2.5 you can get away with tuning the fast uploader to use less heap. It's tricky enough to configure that in the HDP 2.5 docs we chose not to mention the fs.s3a.fast.upload option entirely. It was just too confusing and we couldn't come up with some good defaults which would work reliably. Which is why I rewrote it completely for HDP 2.6. The HDP 2.6/Apache Hadoop 2.8 (and already in HDCloud) block output stream can buffer on disk (default), or via byte buffers, as well as heap, and tries to do better queueing of writes. For HDP 2.5. the tuning options are measured in the Hadoop 2.7 docs, Essentially a lower value of fs.s3a.threads.core and fs.s3a.threads.max keeps the number of buffered blocks down, while changing the size of fs.s3a.multipart.size to something like 10485760 (10 MB) and setting fs.s3a.multipart.threshold to the same value reduces the buffer size before the uploads begin. Like I warned, you can end up spending time tuning, because the heap consumed increases with the threads.max value, and decreases on the multipart threshold and size values. And over a remote connection, the more workers you have in the distcp operation (controlled by the -m option), the less bandwidth each one gets, so again: more heap overflows. And you will invariably find out on the big uploads that there are limits. As a result In HDP-2.5, I'd recommend avoiding the fast upload except in the special case of: you have a very high speed connection to an S3 server in the same infrastructure, and use it for code generating data, rather than big distcp operations, which can read data as fast as it can be streamed off multiple disks.
... View more
03-21-2017
01:10 PM
Good to hear it is fixed. In future, have a look at this list of causes of this exception which commonly surface in Hadoop. in the core Hadoop networking we automatically add a link to that page and more diagnostics (e.g. destination hostname:port) to socket exceptions...maybe I should see if somehow we can wrap the exceptions coming up from the ASF libraries too.
... View more
03-21-2017
12:56 PM
1. You should be using the latest version of HDP or HDCloud you can, to get the speedups on S3A read and write. HDP2.5 has the read pipeline speedup, but not the listing code (used in partitioning) and the write pipeline. 2. Write your data back to HDFS, then at the end of the work, copy to S3. That gives significantly better performance in the downstream jobs, and avoids fundamental mismatch between how work is committed (hive uses renames) and how s3 works (there are no renames, only slow copies). Have a look at this document on Hive on S3 for more advice, including which options to set for maximum IO speedup.
... View more
03-21-2017
12:49 PM
Constantin's answer seems a good way to install a packaged release. Do grab the relevant 2.6, 2.7 2.8 version of the windows executables you'll need underneath : https://github.com/steveloughran/winutils . Or, you can set up the windows bits installing HDP2.5 for windows, then turning off any hadoop services it sets to start automatically. That will put the Hadoop 2.7.x binaries up on your classpath. The other way is to check out and build Spark yourself, which you can just do from maven, or, with an IDE like IntelliJ IDEA, have it import the spark POM and do the build. You'll still need a native windows HADOOP_HOME/bin directory though.
... View more
03-14-2017
08:04 PM
the main principle is "use more than one machine for workload scalability", "commodity parts" for financial scalability, and a filesystem and execution platform which handles the failures that follow from both scale and hardware choices. It doesn't mean you should rush to have a 20 node cluster over a ten node one, not if that node cluster can get your work done faster. In the cloud, if you can do that and so either finish your work more rapidly, or rent less machine time, you get a good outcome. Spark, in particular, loves having lots of RAM, so it can cache the generated results of RDDs. If it has to discard work due to running out of memory, then, if it needs that data later, it will need calculating again, costing time, Without recommending any specific machines, then: look for more RAM when you use Spark. 7GB isn't that much these days; it's less than consumer laptops ship with
... View more
03-14-2017
07:55 PM
That message is odd. At a guess (And this is a guess, as HDFS isn't something I know the internals of), HDFS is rejecting the attempt to close the file as the namenode doesn't think the file is open. Now, does this happen every time? I could imagine this being a transient even as a namenode rebooted or something, but I'd be very surprised to see it repeatedly
... View more
03-14-2017
07:19 PM
1 Kudo
It sounds like an authentication problem, but this can sometime surface with: clock problems, and JVM & Classpath issues. See: https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html and: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
... View more
03-03-2017
10:07 PM
This is recurrent error message in Spark -it means that the classes you are trying to use within the map are only available in the driver, not in the workers. The Spark context is one of these
... View more
03-03-2017
10:04 PM
1 Kudo
EMRFS is an amazon-proprietary replacement for HDFS for cluster storage.
We work on S3A, which is the open source client for reading and writing data in S3: this is not something you can replace HDFS with.
In HDP and HDCloud clusters running in EC2, you must use HDFS for the cluster filesystem, with the S3A client to read data from S3 and write it back and the end of a workflow.
We are doing lots of work on S3A performance, much of which is available in HDCloud and HDP2.5. Note that you can use S3A for remote access to S3 data: between S3 regions and
from physical clusters wherever they live. This lets you use S3 as a backup repository of your Hadoop cluster data.
... View more
03-02-2017
03:07 PM
Task Not serializable is unrelated and very common. The way the scala API works, operations on RDDs like map() work by having the state of the lambda expression copied over to all the worker nodes and then executed. For this to happen, every object referenced inside the expression must be "Serializable", in the strict java API way: it is declared as something which can be serialized to a byte stream, sent over the network and reconstructed at the far end. Something you have declared outside the map, which you are trying to use it inside, isn't serializable. At a guess: one of the Jetty classes, like the "exchange" variable. Workaround? Create the object inside the lambda expression, out of data that has been serialized (strings etc)
... View more
02-23-2017
09:26 PM
Python is easier to learn...Scala is a complex language. But, as a Java developer, having some scala knowledge may be good for your resume, and learning it in a notebook is an easy way to learn the language compared to writing a complex program. One way to learn is to start with very small amounts of data and write tests in scalatest, run them from maven. That way you can use the API you are used to. But the interactive notebooks are a great way to play fast and iterate rapidly without running builds.
... View more
02-17-2017
01:45 PM
Its not so much that hotswap is difficult, but that with a 3 node cluster, a copy of every block is kept on every node. A cold swap, where HDFS notices things are missing, is the traumatic one, as it cannot re-replicate all the blocks and will be complaining about underreplication. If you can do a hot swap in OS & hardware, then you should stop the DN before doing that, and start it afterwards. It will examine its directories and report all the blocks it has to the namenode. If the cluster has underreplicated blocks, the DN will get told to copy them from the other two datanodes, which will take a time dependent on the number of blocks which were on the swapped disk (and which haven't already been considered missing and re-replicated onto other disks on the same datanode) Maybe @Arpit Agarwal has some other/different advice. Arpit, presumably the new HDD will be unbalanced compared to the rest of the disks on the DN. What can be done about that in HDFS?
... View more