About stevel

stevel · ‎12-14-2015

It's actual title is "Hadoop and Kerberos: The Madness Beyond the Gate" —there's an HP Lovecraft theme of "forbidden knowledge which will drive you insane" which is less a joke and more commentary. it's actually rendered on gitbook If you are working with Kerberos, get a copy of the O'Reilly Hadoop Security book too. My little e-book was written to cover the bits that was left out: to extend rather than replace. Finally, being open source: contributions are welcome

stevel · ‎12-14-2015

thank you. View it as working notes to avoid me having to send emails to colleagues trying to understand things. And being working notes, it only covers the problems I've encountered. There are many more out there, and in fact I am having serious problems with Kerberos right now which have even me defeated. So don't expect it to solve all your problems.

stevel · ‎12-13-2015

I must disagree. Dedicating machines via labels is not always the right choice. Imagine you give 20 nodes in a 100 node cluster the label "spark", with only spark-queue work able to run on it. When there's no work on that queue: the machines are idle. When there is work in the queue, it'll only get run on those 20 nodes. There's also replication & data locality to consider: if the data you need isn't on one of those 20 nodes, it'll be remote —which can also hurt performance. You really need to look at the cluster and workload to make a good choice

stevel · ‎12-13-2015

If you are running spark applications on a YARN cluster then you do not need to directly allocate memory or machines to it. You can dedicate machines via labels, either for exclusive workloads or to handle heterogenous hardware better. If there is some application where latency and the ability to respond immediately to spikes in load matters, then dedicated labels work. For example; HBase in interactive applications. If different parts of the cluster have different hardware configurations (example: RAM, GPU, SSD for local storage), then labels helps you schedule jobs which need those features to only be executed on those machines Once you start using labels, the labelled hosts will be underutilized when that specific work isn't running: the permanent tradeoff. If you are just running queries on a cluster where that latency isn't so critical that you want to pre-allocate capacity on isolated machines, —then using queues makes is more efficient. You can also set up queue priorities and pre-emption, so your important spark queries can actually pre-empt (i.e. kill) ongoing work from lower-priority applications. What is important for Spark is having your jobs ask for the memory they really need: Spark likes a lot, and if the spark JVM/python code consumes more than was allocated to them in the Yarn container requests, the processes may get killed.

stevel · ‎12-12-2015

There isn't really much in the way of Ceph integration. There is a published filesystem client JAR which, if you get on your classpath, should let you refer to data using ceph:// as the path. You also appear to need its native lib on the path, which is a bit trickier. This comes from the Ceph team, not the Hadoop people, and 1. I don't know how up to date/in sync it is with recent Hadoop versions. 2. It doesn't get released or tested by the Hadoop team: we don't know how well it works, or how it goes wrong. Filesystems are an interesting topic in Hadoop. Its a core critical part of the system: you don't want to lose data. And while there's lots of support for different filesystem implementations in hadoop (s3n, avs, ftp , swift: file:), HDFS is the one things are built and tested against. Object stores (s3, swift) are not real filesystems, and cannot be used in place of HDFS as the direct output of MR, Tez or spark jobs; and absolutely never to run HBase or accumulo atop. I don't know where ceph fits in here. It's probably safe to use it as a source of data; it's as the destination where the differences usually show up. Finally: HDP is not tested on Ceph, so cannot be supported. We do test on HDFS, against Azure storage (in HD/Insight), and on other filesystems (e.g. Isilon). I don't know of anyone else who tests Hadoop on Ceph, the way, say Redhat do with Gluster FS.

stevel · ‎12-12-2015

This looks like it's being triggered by the Spark -> timeline server integration, as ATS is going OOM when handling spark events. Which means its my code running in the spark jobs triggering this. What kind of jobs are you running? Short lived? Long-lived? Many executors? The best short-term fix is for you to disable the timeline server integration, and set the spark applications up to log to HDFS instead, with the history server reading it from there. The details of this are covered in Spark Monitoring 1. In the spark job configuration you need to disable the ATS publishing. Find the line spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService -delete it set the property spark.history.fs.logDirectory to an HDFS directory which must be writeable by everyone. For example, hdfs://shared/logfiles spark.eventLog.enabled true spark.eventLog.compress truespark.history.fs.logDirectoryhdfs://shared/logfiles 2. In the history server you need to switch to the filesystem log provider spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectoryhdfs://shared/logfiles The next spark release we'll have up for download (soon!) will log less events to the timeline server. Hopefully it will reduce the problems on the timeline server. There's also lots of work going on in the timeline server for future hadoop versions to handle larger amounts of data —by mixing stuff kept in HDFS with the leveldb data. for now, switching to the filesystem provider is your best bet

stevel · ‎12-09-2015

A key one is straightforward: HDFS is where the data is. YARN schedules work by that data. YARN clusters are very widely deployed, Spark on YARN lets you run Spark queries against that cluster without you even needing to ask permissions from the cluster opts team. To them, it's just another client job.

stevel · ‎12-09-2015

Note that Spark 1.5+ is needed for spark jobs of duration > 72h not to fail when their kerberos tickers expire. And you'll need to supply a keytab which the Spark AM can renew tickets with. For short-lived queries, this problem should not surface

stevel · ‎12-09-2015

Note also you are going to get less IO bandwidth, as you move from 3 replicas (and hence 3 places to run code locally), to what is essentially a single replica, with the data spread across the network. Erasure coding is for best storing cold data where the improvements in storage density is tangible: it will hurt performance through -loss of locality (network layer) -loss of replicas (disk IO layer) -need to rebuild the raw data (CPU overhead) I don't think we have any figures yet on the impact. On a brighter note, 10GbE ToR switches are falling in price, so you could thing about going to 10 Gb on-rack, even if the backbone remains a bottleneck

stevel · ‎11-30-2015

Primarily so that Ambari can use it to deploy and manage things via slider; it doesn't need to be installed on other machines in the cluster.

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: Kerberos: The Missing Guide

Re: Kerberos: The Missing Guide

Re: Spark Deployment and Hardware Provisioning

Re: Spark Deployment and Hardware Provisioning

Re: Is there a support method to using Ceph storag...

Re: Yarn timeline server periodically fails

Re: Spark on YARN vs Mesos?

Re: What security is available for Spark?

Re: How will Erasure Coding affect the principle o...

Re: Slider service install not required for slider...