Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3296 | 02-27-2018 04:47 PM | |
5812 | 03-03-2017 10:04 PM | |
3441 | 02-16-2017 10:18 AM | |
1810 | 01-20-2017 02:15 PM | |
11748 | 01-20-2017 02:02 PM |
12-14-2015
07:30 PM
It's actual title is "Hadoop and Kerberos: The Madness Beyond the Gate" —there's an HP Lovecraft theme of "forbidden knowledge which will drive you insane" which is less a joke and more commentary. it's actually rendered on gitbook If you are working with Kerberos, get a copy of the O'Reilly Hadoop Security book too. My little e-book was written to cover the bits that was left out: to extend rather than replace. Finally, being open source: contributions are welcome
... View more
12-14-2015
07:26 PM
thank you. View it as working notes to avoid me having to send emails to colleagues trying to understand things. And being working notes, it only covers the problems I've encountered. There are many more out there, and in fact I am having serious problems with Kerberos right now which have even me defeated. So don't expect it to solve all your problems.
... View more
12-13-2015
03:41 PM
I must disagree. Dedicating machines via labels is not always the right choice. Imagine you give 20 nodes in a 100 node cluster the label "spark", with only spark-queue work able to run on it. When there's no work on that queue: the machines are idle. When there is work in the queue, it'll only get run on those 20 nodes. There's also replication & data locality to consider: if the data you need isn't on one of those 20 nodes, it'll be remote —which can also hurt performance. You really need to look at the cluster and workload to make a good choice
... View more
12-13-2015
03:35 PM
1 Kudo
If you are running spark applications on a YARN cluster then you do not need to directly allocate memory or machines to it. You can dedicate machines via labels, either for exclusive workloads or to handle heterogenous hardware better. If there is some application where latency and the ability to respond immediately to spikes in load matters, then dedicated labels work. For example; HBase in interactive applications. If different parts of the cluster have different hardware configurations (example: RAM, GPU, SSD for local storage), then labels helps you schedule jobs which need those features to only be executed on those machines Once you start using labels, the labelled hosts will be underutilized when that specific work isn't running: the permanent tradeoff. If you are just running queries on a cluster where that latency isn't so critical that you want to pre-allocate capacity on isolated machines, —then using queues makes is more efficient. You can also set up queue priorities and pre-emption, so your important spark queries can actually pre-empt (i.e. kill) ongoing work from lower-priority applications.
What is important for Spark is having your jobs ask for the memory they really need: Spark likes a lot, and if the spark JVM/python code consumes more than was allocated to them in the Yarn container requests, the processes may get killed.
... View more
12-12-2015
02:09 PM
2 Kudos
There isn't really much in the way of Ceph integration. There is a published filesystem client JAR which, if you get on your classpath, should let you refer to data using ceph:// as the path. You also appear to need its native lib on the path, which is a bit trickier. This comes from the Ceph team, not the Hadoop people, and 1. I don't know how up to date/in sync it is with recent Hadoop versions. 2. It doesn't get released or tested by the Hadoop team: we don't know how well it works, or how it goes wrong. Filesystems are an interesting topic in Hadoop. Its a core critical part of the system: you don't want to lose data. And while there's lots of support for different filesystem implementations in hadoop (s3n, avs, ftp , swift: file:), HDFS is the one things are built and tested against. Object stores (s3, swift) are not real filesystems, and cannot be used in place of HDFS as the direct output of MR, Tez or spark jobs; and absolutely never to run HBase or accumulo atop. I don't know where ceph fits in here. It's probably safe to use it as a source of data; it's as the destination where the differences usually show up. Finally: HDP is not tested on Ceph, so cannot be supported. We do test on HDFS, against Azure storage (in HD/Insight), and on other filesystems (e.g. Isilon). I don't know of anyone else who tests Hadoop on Ceph, the way, say Redhat do with Gluster FS.
... View more
12-12-2015
02:01 PM
4 Kudos
This looks like it's being triggered by the Spark -> timeline server integration, as ATS is going OOM when handling spark events. Which means its my code running in the spark jobs triggering this. What kind of jobs are you running? Short lived? Long-lived? Many executors? The best short-term fix is for you to disable the timeline server integration, and set the spark applications up to log to HDFS instead, with the history server reading it from there. The details of this are covered in Spark Monitoring 1. In the spark job configuration you need to disable the ATS publishing. Find the line spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
-delete it set the property spark.history.fs.logDirectory to an HDFS directory which must be writeable by everyone. For example, hdfs://shared/logfiles spark.eventLog.enabled true
spark.eventLog.compress truespark.history.fs.logDirectoryhdfs://shared/logfiles
2. In the history server you need to switch to the filesystem log provider spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectoryhdfs://shared/logfiles
The next spark release we'll have up for download (soon!) will log less events to the timeline server. Hopefully it will reduce the problems on the timeline server. There's also lots of work going on in the timeline server for future hadoop versions to handle larger amounts of data —by mixing stuff kept in HDFS with the leveldb data. for now, switching to the filesystem provider is your best bet
... View more
12-09-2015
07:17 PM
A key one is straightforward: HDFS is where the data is. YARN schedules work by that data. YARN clusters are very widely deployed, Spark on YARN lets you run Spark queries against that cluster without you even needing to ask permissions from the cluster opts team. To them, it's just another client job.
... View more
12-09-2015
07:14 PM
Note that Spark 1.5+ is needed for spark jobs of duration > 72h not to fail when their kerberos tickers expire. And you'll need to supply a keytab which the Spark AM can renew tickets with. For short-lived queries, this problem should not surface
... View more
12-09-2015
01:42 PM
1 Kudo
Note also you are going to get less IO bandwidth, as you move from 3 replicas (and hence 3 places to run code locally), to what is essentially a single replica, with the data spread across the network. Erasure coding is for best storing cold data where the improvements in storage density is tangible: it will hurt performance through -loss of locality (network layer) -loss of replicas (disk IO layer) -need to rebuild the raw data (CPU overhead) I don't think we have any figures yet on the impact. On a brighter note, 10GbE ToR switches are falling in price, so you could thing about going to 10 Gb on-rack, even if the backbone remains a bottleneck
... View more
11-30-2015
06:14 PM
Primarily so that Ambari can use it to deploy and manage things via slider; it doesn't need to be installed on other machines in the cluster.
... View more