Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2662 | 02-27-2018 04:47 PM | |
5133 | 03-03-2017 10:04 PM | |
2704 | 02-16-2017 10:18 AM | |
1424 | 01-20-2017 02:15 PM | |
10626 | 01-20-2017 02:02 PM |
09-11-2017
10:32 AM
S3A actually has an extra option to let you set per-bucket jceks files, fs.s3a.security.credential.provider.path This takes the same values as the normal one, but lets you take advantage of the per-bucket config feature of s3a, where every bucket-specific option of fs.s3a.bucket.* is remapped to fs.s3a.* before the bucket is set up.
you should be able to add a reference to it likes so
spark.hadoop.fs.s3a.bucket.b.security.credential.provider.path hdfs:///something.jceks
Hopefully this helps. One challenge we always have with the authentication work is that we can't log it at the detail we'd like, because that would leak secrets too easily...so even when logging at debug, not enough information gets printed. Sorry
see also: https://hortonworks.github.io/hdp-aws/s3-security/index.html
Oh, one more thing. spark-submit copies your local AWS_ environment variables over to the fs.s3a.secret,key and fs.s3a.access.key values. Try unsetting them before you submit work and see if that makes a difference
... View more
06-29-2017
07:54 PM
2 Kudos
Ok, you've found a new problem. Congratulations. Or commisserations. Filing a bug against that ().
the codepath triggering this should only be reached if fs.s3a.security.credential.provider.path is set. That should only be needed if you are hoping to provide a specific set of credentials for different buckets, so customising it for the different bucket (fs.s3a.bucket.dev-1.security.credential.provider.path=/secrets/dev.jceks) etc. If you have one set of secrets for all S3 buckets, set it in the main config for everything. Which you are trying to on the second attempt. Maybe @lmccay has some suggestion.
... View more
06-29-2017
07:45 PM
That error from AWS suspected to be the S3 connection being broken, and the XML parser in the Amazon SDK getting the end of the document & failing. I'm surprised you are seeing it frequently though; it's generally pretty rare (i.e. rare enough that we've not got that much details on what is going on). It might be fs.s3a.connection.timeout is the parameter to tune, but the other possiblity is that you have too many threads/tasks talking to S3 and either your network bandwidth is used up or AWS S3 is actually throttling you. Try smaller values of fs.s3a.threads.max (say 64 or fewer) and of fs.s3a.max.total.tasks (try 128). That cuts down the # of threads which may write at a time, and then has a smaller queue of waiting blocks to write before it blocks whatever thread is actually generating lots of of data.
... View more
03-24-2017
11:19 AM
Yes, I'm afraid that fast upload can overload the buffers in Hadoop 2.5, as it uses JVM heap to store blocks while it uploads them. The bigger the mismatch between the data generated (i.e. how fast things can be read) and the upload bandwidth, the more heap you need. On a long-haul upload you usually have limited bandwidth, and the more distcp workers, the more the bandwidth is divided between them, the bigger the mismatch a In Hadoop 2.5 you can get away with tuning the fast uploader to use less heap. It's tricky enough to configure that in the HDP 2.5 docs we chose not to mention the fs.s3a.fast.upload option entirely. It was just too confusing and we couldn't come up with some good defaults which would work reliably. Which is why I rewrote it completely for HDP 2.6. The HDP 2.6/Apache Hadoop 2.8 (and already in HDCloud) block output stream can buffer on disk (default), or via byte buffers, as well as heap, and tries to do better queueing of writes. For HDP 2.5. the tuning options are measured in the Hadoop 2.7 docs, Essentially a lower value of fs.s3a.threads.core and fs.s3a.threads.max keeps the number of buffered blocks down, while changing the size of fs.s3a.multipart.size to something like 10485760 (10 MB) and setting fs.s3a.multipart.threshold to the same value reduces the buffer size before the uploads begin. Like I warned, you can end up spending time tuning, because the heap consumed increases with the threads.max value, and decreases on the multipart threshold and size values. And over a remote connection, the more workers you have in the distcp operation (controlled by the -m option), the less bandwidth each one gets, so again: more heap overflows. And you will invariably find out on the big uploads that there are limits. As a result In HDP-2.5, I'd recommend avoiding the fast upload except in the special case of: you have a very high speed connection to an S3 server in the same infrastructure, and use it for code generating data, rather than big distcp operations, which can read data as fast as it can be streamed off multiple disks.
... View more
03-21-2017
01:10 PM
Good to hear it is fixed. In future, have a look at this list of causes of this exception which commonly surface in Hadoop. in the core Hadoop networking we automatically add a link to that page and more diagnostics (e.g. destination hostname:port) to socket exceptions...maybe I should see if somehow we can wrap the exceptions coming up from the ASF libraries too.
... View more
03-21-2017
12:56 PM
1. You should be using the latest version of HDP or HDCloud you can, to get the speedups on S3A read and write. HDP2.5 has the read pipeline speedup, but not the listing code (used in partitioning) and the write pipeline. 2. Write your data back to HDFS, then at the end of the work, copy to S3. That gives significantly better performance in the downstream jobs, and avoids fundamental mismatch between how work is committed (hive uses renames) and how s3 works (there are no renames, only slow copies). Have a look at this document on Hive on S3 for more advice, including which options to set for maximum IO speedup.
... View more
03-14-2017
07:55 PM
That message is odd. At a guess (And this is a guess, as HDFS isn't something I know the internals of), HDFS is rejecting the attempt to close the file as the namenode doesn't think the file is open. Now, does this happen every time? I could imagine this being a transient even as a namenode rebooted or something, but I'd be very surprised to see it repeatedly
... View more
03-03-2017
10:04 PM
1 Kudo
EMRFS is an amazon-proprietary replacement for HDFS for cluster storage.
We work on S3A, which is the open source client for reading and writing data in S3: this is not something you can replace HDFS with.
In HDP and HDCloud clusters running in EC2, you must use HDFS for the cluster filesystem, with the S3A client to read data from S3 and write it back and the end of a workflow.
We are doing lots of work on S3A performance, much of which is available in HDCloud and HDP2.5. Note that you can use S3A for remote access to S3 data: between S3 regions and
from physical clusters wherever they live. This lets you use S3 as a backup repository of your Hadoop cluster data.
... View more
03-02-2017
03:07 PM
Task Not serializable is unrelated and very common. The way the scala API works, operations on RDDs like map() work by having the state of the lambda expression copied over to all the worker nodes and then executed. For this to happen, every object referenced inside the expression must be "Serializable", in the strict java API way: it is declared as something which can be serialized to a byte stream, sent over the network and reconstructed at the far end. Something you have declared outside the map, which you are trying to use it inside, isn't serializable. At a guess: one of the Jetty classes, like the "exchange" variable. Workaround? Create the object inside the lambda expression, out of data that has been serialized (strings etc)
... View more
02-23-2017
09:26 PM
Python is easier to learn...Scala is a complex language. But, as a Java developer, having some scala knowledge may be good for your resume, and learning it in a notebook is an easy way to learn the language compared to writing a complex program. One way to learn is to start with very small amounts of data and write tests in scalatest, run them from maven. That way you can use the API you are used to. But the interactive notebooks are a great way to play fast and iterate rapidly without running builds.
... View more