Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1915 | 02-27-2018 04:47 PM | |
4480 | 03-03-2017 10:04 PM | |
1941 | 02-16-2017 10:18 AM | |
1027 | 01-20-2017 02:15 PM | |
9256 | 01-20-2017 02:02 PM |
01-16-2017
12:00 PM
One problem may be partioning: the spark app may not know how to divide processing the .tar.gz amongst many workers, so is handing it off to one. That's a problem with .gz files in general. I haven't done any XML/tar processing work in spark myself, so am not confident about where to begin. You could look at the history server to see how work was split up. Otherwise: try throwing the work at spark as a directory full of XML files (maybe .gz individually), rather than a single .tar.gz. If that speeds up, then it could be a sign that partitioning is the problem. It would then become a matter of working out how to split up those original 400MB source fies into a smaller set (e.g. 20 x 20MB files), & see if that parallelized better
... View more
01-14-2017
01:34 PM
1 Kudo
One more thing: S3A doesn't handle R/W buckets with different ACLs attached to different parts of it, as the code expects to have write access to any R/W repo. You should restrict access by bucket, not by trying to use some kind of ACL within a bucket
... View more
01-14-2017
01:31 PM
1 Kudo
S3A on HDP 2.5 supports server side encryption. <property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>AES256</value>
<description>Specify a server-side encryption algorithm for s3a: file system.
Unset by default, and the only other currently allowable value is AES256.
</description>
</property>
... View more
01-14-2017
01:27 PM
Tom: you don't need to set the fs.s3a.secrets if running on EC2; S3A will pick the auth details automatically from the IAM metadata made available to processes in the VM.
... View more
01-14-2017
01:24 PM
2 Kudos
If things aren't working with HDP 2.5 or HDCloud, I'd recommend starting with [Troubleshooting S3a](https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html) If you are using ASF released binaries, then those docs are mostly valid too, though as we pulled in much of the later features coming in S3a on Hadoop 2.8 (after writing them!), the docs are a bit inconsistent. The closest ASF docs on troubleshooting are those for [Hadoop 2.8](https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md#troubleshooting-s3a). As Kasper pointed out, this is due to AWS JAR versioning. the Amazon SDK has been pretty brittle against change, and you *must* run with the same version of the AWS SDK which Hadoop was built with (which also needs a consistent version of jackson, ...). Hadoop 2.7.x: AWS SDK 1.7.4 Hadoop 2.8.x: 1.10.6 Hadoop 2.9+: probably 10.11+ or later, with jackson bumped up to 2.7.8 to match.
... View more
01-09-2017
09:45 AM
Like I said, if you are running on EC2, you should be able to play with Netflix's Chaos Monkey direct. I haven't used it for a while; https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide covers starting it....I think it's got more complex than in the early days, when it was more of a CLI thing
... View more
01-06-2017
08:20 PM
Good Q. Not explicitly, AFAIK. We do have a integral chaos monkey in Slider (incubating), which you just turn on, give a sleep time and then schedule multiple actions (worker death, AM death). If you are working with an EC2 cluster, you can just use Netflix's Chaos Monkey lib and have it do the killing. Otherwise, the general best practise is to have something automated to SSH in and find/kill processes. I don't have any up to date code for this; I used to somewhere but it relates to older linux versions, and has probably aged now. I'm afraid you'll have to look around online for that. What is really, really slick for testing HA failover is code to turn real/virtual network switches off. This is good as it lets you rigorously test what happens if there's a network partition and everything stays running, just unreachable. Pro tip: issuing a kill -SIGSTOP is a great way to simulate a hung (as opposed to a failed) process.
... View more
12-14-2016
02:48 PM
1 Kudo
The spark logging code is Spark's Logger class, which does lazy eval of expressions like logInfo(s"status $value") Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision. When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient. The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,
... View more
12-12-2016
10:56 AM
3 Kudos
you can use different buckets which a single account has access to, but, no, you can't do work across accounts, because of that single s3a.access.key property There is one *dangerous* way to work around that, which is put the key and secret in the URL of the form s3a://key:secret@bucket/path . That encodes the secret in the URL, and takes precedent over anything in the configuration. But those URLs will end up being logged in places, so there's a risk of the secrets getting into the logs. This is why when you authenticate this way, warning messages are printed. This is something which is going to have to be fixed, not just for the authentication but to deal with the rollout of Amazon's V4 authentication mechanism, where you need to specify the S3 endpoint for the region you need to work with (frankfurt and seol so far) Supporting multiple regions is a similar problem to having multiple accounts: different buckets need different settings.
... View more
11-29-2016
12:30 PM
1 Kudo
you need to set the s3a properties to log in; these are separate from the s3n ones see: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md see also: http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html
... View more