About stevel

stevel · ‎01-16-2017

One problem may be partioning: the spark app may not know how to divide processing the .tar.gz amongst many workers, so is handing it off to one. That's a problem with .gz files in general. I haven't done any XML/tar processing work in spark myself, so am not confident about where to begin. You could look at the history server to see how work was split up. Otherwise: try throwing the work at spark as a directory full of XML files (maybe .gz individually), rather than a single .tar.gz. If that speeds up, then it could be a sign that partitioning is the problem. It would then become a matter of working out how to split up those original 400MB source fies into a smaller set (e.g. 20 x 20MB files), & see if that parallelized better

stevel · ‎01-14-2017

One more thing: S3A doesn't handle R/W buckets with different ACLs attached to different parts of it, as the code expects to have write access to any R/W repo. You should restrict access by bucket, not by trying to use some kind of ACL within a bucket

stevel · ‎01-14-2017

S3A on HDP 2.5 supports server side encryption. <property> <name>fs.s3a.server-side-encryption-algorithm</name> <value>AES256</value> <description>Specify a server-side encryption algorithm for s3a: file system. Unset by default, and the only other currently allowable value is AES256. </description> </property>

stevel · ‎01-14-2017

Tom: you don't need to set the fs.s3a.secrets if running on EC2; S3A will pick the auth details automatically from the IAM metadata made available to processes in the VM.

stevel · ‎01-14-2017

If things aren't working with HDP 2.5 or HDCloud, I'd recommend starting with [Troubleshooting S3a](https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html) If you are using ASF released binaries, then those docs are mostly valid too, though as we pulled in much of the later features coming in S3a on Hadoop 2.8 (after writing them!), the docs are a bit inconsistent. The closest ASF docs on troubleshooting are those for [Hadoop 2.8](https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md#troubleshooting-s3a). As Kasper pointed out, this is due to AWS JAR versioning. the Amazon SDK has been pretty brittle against change, and you *must* run with the same version of the AWS SDK which Hadoop was built with (which also needs a consistent version of jackson, ...). Hadoop 2.7.x: AWS SDK 1.7.4 Hadoop 2.8.x: 1.10.6 Hadoop 2.9+: probably 10.11+ or later, with jackson bumped up to 2.7.8 to match.

stevel · ‎01-09-2017

Like I said, if you are running on EC2, you should be able to play with Netflix's Chaos Monkey direct. I haven't used it for a while; https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide covers starting it....I think it's got more complex than in the early days, when it was more of a CLI thing

stevel · ‎01-06-2017

Good Q. Not explicitly, AFAIK. We do have a integral chaos monkey in Slider (incubating), which you just turn on, give a sleep time and then schedule multiple actions (worker death, AM death). If you are working with an EC2 cluster, you can just use Netflix's Chaos Monkey lib and have it do the killing. Otherwise, the general best practise is to have something automated to SSH in and find/kill processes. I don't have any up to date code for this; I used to somewhere but it relates to older linux versions, and has probably aged now. I'm afraid you'll have to look around online for that. What is really, really slick for testing HA failover is code to turn real/virtual network switches off. This is good as it lets you rigorously test what happens if there's a network partition and everything stays running, just unreachable. Pro tip: issuing a kill -SIGSTOP is a great way to simulate a hung (as opposed to a failed) process.

stevel · ‎12-14-2016

The spark logging code is Spark's Logger class, which does lazy eval of expressions like logInfo(s"status $value") Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision. When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient. The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,

stevel · ‎12-12-2016

you can use different buckets which a single account has access to, but, no, you can't do work across accounts, because of that single s3a.access.key property There is one *dangerous* way to work around that, which is put the key and secret in the URL of the form s3a://key:secret@bucket/path . That encodes the secret in the URL, and takes precedent over anything in the configuration. But those URLs will end up being logged in places, so there's a risk of the secrets getting into the logs. This is why when you authenticate this way, warning messages are printed. This is something which is going to have to be fixed, not just for the authentication but to deal with the rollout of Amazon's V4 authentication mechanism, where you need to specify the S3 endpoint for the region you need to work with (frankfurt and seol so far) Supporting multiple regions is a similar problem to having multiple accounts: different buckets need different settings.

stevel · ‎11-29-2016

you need to set the s3a properties to log in; these are separate from the s3n ones see: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md see also: http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: com.databricks.spark.xml parsing xml takes a v...

Re: HDP install in AWS VPC, on custom AMI -- feasi...

Re: HDP install in AWS VPC, on custom AMI -- feasi...

Re: HDP install in AWS VPC, on custom AMI -- feasi...

Re: Spark S3 write failed

Re: How to use Chaos Monkey in Ambari cluster setu...

Re: How to use Chaos Monkey in Ambari cluster setu...

Re: How to do logging in Spark Applications withou...

Re: Is it possible to create two tables with diffe...

Re: How to use s3a with HDP