About stevel

stevel · ‎09-11-2017

S3A actually has an extra option to let you set per-bucket jceks files, fs.s3a.security.credential.provider.path This takes the same values as the normal one, but lets you take advantage of the per-bucket config feature of s3a, where every bucket-specific option of fs.s3a.bucket.* is remapped to fs.s3a.* before the bucket is set up. you should be able to add a reference to it likes so spark.hadoop.fs.s3a.bucket.b.security.credential.provider.path hdfs:///something.jceks Hopefully this helps. One challenge we always have with the authentication work is that we can't log it at the detail we'd like, because that would leak secrets too easily...so even when logging at debug, not enough information gets printed. Sorry see also: https://hortonworks.github.io/hdp-aws/s3-security/index.html Oh, one more thing. spark-submit copies your local AWS_ environment variables over to the fs.s3a.secret,key and fs.s3a.access.key values. Try unsetting them before you submit work and see if that makes a difference

stevel · ‎06-29-2017

Ok, you've found a new problem. Congratulations. Or commisserations. Filing a bug against that (). the codepath triggering this should only be reached if fs.s3a.security.credential.provider.path is set. That should only be needed if you are hoping to provide a specific set of credentials for different buckets, so customising it for the different bucket (fs.s3a.bucket.dev-1.security.credential.provider.path=/secrets/dev.jceks) etc. If you have one set of secrets for all S3 buckets, set it in the main config for everything. Which you are trying to on the second attempt. Maybe @lmccay has some suggestion.

stevel · ‎06-29-2017

That error from AWS suspected to be the S3 connection being broken, and the XML parser in the Amazon SDK getting the end of the document & failing. I'm surprised you are seeing it frequently though; it's generally pretty rare (i.e. rare enough that we've not got that much details on what is going on). It might be fs.s3a.connection.timeout is the parameter to tune, but the other possiblity is that you have too many threads/tasks talking to S3 and either your network bandwidth is used up or AWS S3 is actually throttling you. Try smaller values of fs.s3a.threads.max (say 64 or fewer) and of fs.s3a.max.total.tasks (try 128). That cuts down the # of threads which may write at a time, and then has a smaller queue of waiting blocks to write before it blocks whatever thread is actually generating lots of of data.

stevel · ‎03-24-2017

Yes, I'm afraid that fast upload can overload the buffers in Hadoop 2.5, as it uses JVM heap to store blocks while it uploads them. The bigger the mismatch between the data generated (i.e. how fast things can be read) and the upload bandwidth, the more heap you need. On a long-haul upload you usually have limited bandwidth, and the more distcp workers, the more the bandwidth is divided between them, the bigger the mismatch a In Hadoop 2.5 you can get away with tuning the fast uploader to use less heap. It's tricky enough to configure that in the HDP 2.5 docs we chose not to mention the fs.s3a.fast.upload option entirely. It was just too confusing and we couldn't come up with some good defaults which would work reliably. Which is why I rewrote it completely for HDP 2.6. The HDP 2.6/Apache Hadoop 2.8 (and already in HDCloud) block output stream can buffer on disk (default), or via byte buffers, as well as heap, and tries to do better queueing of writes. For HDP 2.5. the tuning options are measured in the Hadoop 2.7 docs, Essentially a lower value of fs.s3a.threads.core and fs.s3a.threads.max keeps the number of buffered blocks down, while changing the size of fs.s3a.multipart.size to something like 10485760 (10 MB) and setting fs.s3a.multipart.threshold to the same value reduces the buffer size before the uploads begin. Like I warned, you can end up spending time tuning, because the heap consumed increases with the threads.max value, and decreases on the multipart threshold and size values. And over a remote connection, the more workers you have in the distcp operation (controlled by the -m option), the less bandwidth each one gets, so again: more heap overflows. And you will invariably find out on the big uploads that there are limits. As a result In HDP-2.5, I'd recommend avoiding the fast upload except in the special case of: you have a very high speed connection to an S3 server in the same infrastructure, and use it for code generating data, rather than big distcp operations, which can read data as fast as it can be streamed off multiple disks.

stevel · ‎03-21-2017

Good to hear it is fixed. In future, have a look at this list of causes of this exception which commonly surface in Hadoop. in the core Hadoop networking we automatically add a link to that page and more diagnostics (e.g. destination hostname:port) to socket exceptions...maybe I should see if somehow we can wrap the exceptions coming up from the ASF libraries too.

stevel · ‎03-21-2017

1. You should be using the latest version of HDP or HDCloud you can, to get the speedups on S3A read and write. HDP2.5 has the read pipeline speedup, but not the listing code (used in partitioning) and the write pipeline. 2. Write your data back to HDFS, then at the end of the work, copy to S3. That gives significantly better performance in the downstream jobs, and avoids fundamental mismatch between how work is committed (hive uses renames) and how s3 works (there are no renames, only slow copies). Have a look at this document on Hive on S3 for more advice, including which options to set for maximum IO speedup.

stevel · ‎03-14-2017

That message is odd. At a guess (And this is a guess, as HDFS isn't something I know the internals of), HDFS is rejecting the attempt to close the file as the namenode doesn't think the file is open. Now, does this happen every time? I could imagine this being a transient even as a namenode rebooted or something, but I'd be very surprised to see it repeatedly

stevel · ‎03-03-2017

EMRFS is an amazon-proprietary replacement for HDFS for cluster storage. We work on S3A, which is the open source client for reading and writing data in S3: this is not something you can replace HDFS with. In HDP and HDCloud clusters running in EC2, you must use HDFS for the cluster filesystem, with the S3A client to read data from S3 and write it back and the end of a workflow. We are doing lots of work on S3A performance, much of which is available in HDCloud and HDP2.5. Note that you can use S3A for remote access to S3 data: between S3 regions and from physical clusters wherever they live. This lets you use S3 as a backup repository of your Hadoop cluster data.

stevel · ‎03-02-2017

Task Not serializable is unrelated and very common. The way the scala API works, operations on RDDs like map() work by having the state of the lambda expression copied over to all the worker nodes and then executed. For this to happen, every object referenced inside the expression must be "Serializable", in the strict java API way: it is declared as something which can be serialized to a byte stream, sent over the network and reconstructed at the far end. Something you have declared outside the map, which you are trying to use it inside, isn't serializable. At a guess: one of the Jetty classes, like the "exchange" variable. Workaround? Create the object inside the lambda expression, out of data that has been serialized (strings etc)

stevel · ‎02-23-2017

Python is easier to learn...Scala is a complex language. But, as a Java developer, having some scala knowledge may be good for your resume, and learning it in a notebook is an easy way to learn the language compared to writing a complex program. One way to learn is to start with very small amounts of data and write tests in scalatest, run them from maven. That way you can use the API you are used to. But the interactive notebooks are a great way to play fast and iterate rapidly without running builds.

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: Access S3 Bucket from Spark

Re: Issue passing credential provider for reading ...

Re: Hive to S3 Error - timeout?

Re: "Error: Java heap space" caused using distcp t...

Re: hive s3a performance issue

Re: which way is the best when using hive to analy...

Re: File not found exception on _temporary directo...

Re: What is EMRFS? Is it a file system in AWS that...

Re: com.fasterxml.jackson.databind.JsonMappingExce...

Re: What language should I use to learn Spark?