Member since
07-11-2014
7
Posts
0
Kudos Received
0
Solutions
07-29-2014
01:15 AM
@Tgrayson wrote: Mike, [...] We do have an API for cloudera manager that allows you to inspect and set cluster configuration values programatically (documented here http://cloudera.github.io/cm_api/apidocs/v6/ and here for examples and historical version reference http://cloudera.github.io/cm_api/). [...] Did those links go dead? Getting 404s now?
... View more
07-23-2014
02:45 AM
Thanks, Todd (oh and, hi, to a former fellow Navidecer! Been a while... 😉 I'm going to give the API a look, though first I'll try the config dump. I wonder if a staged migration would be possible. First, temporarily lower replication to a minimum, maybe two. Then start phasing out datanodes and letting them be marked as offline. Finally put those under CM control and migrate data from the other remaining nodes. Need to have a think about this...
... View more
07-21-2014
04:29 PM
Hi, I tried a search on this, but the topic is a bit squishy: I'm trying to find a good, non-destructive way to install Cloudera Manager and integrate an *existing* and working cluster. Said cluster was manually installed (for the learning experience), as per the CDH5 docs and the Hadoop Operations book. Some first minor attempts, using one of the datanodes/nodemanagers as a guinea pig, scared me: CM tried to force install all sorts of packages, never checking whether and what was already there. Never making an attempt of reading existing config files and importing them first. I have local repositories (for yum) all set up and working. I know how to point CM at them to use them. However, I would like to be able to point CM at my exising nodes (namenodes/resourcemanagers, datanodes/nodemanager, etc) and integrate their existing config, adjusting what's needed, adding packages only when I expressly tell CM to. Is there a safe way? Thanks Mike
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
-
Cloudera Manager
07-11-2014
04:13 PM
@srowen wrote: This might help a lot: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ [...] Things get confusing because of at least two things. First, there are two different types of YARN deployment, although, I don't think they affect how you think about placing services. But second, there is also "standalone" mode, the default in 0.9.0 and what you are currently using, wherein you actually do separately control where Spark workers run, separately from YARN. I suppose I'd say the thing that matters is: datanodes and nodemanagers and spark worker services are present on all machines doing work. Many thanks for the blog link! I need to spend a week just catching up on the Cloudera blog one of these days. You guys produce some fantastic articles in there. Yes, it was the mild confusion about 0.9 using this "standalone" mode instead of YARN which made me try and for for 1.0. Whether it was something I read on the Apache Spark pages or some other link via Google gave me the impression that the 0.9 RPMs are hard-wired for stand-alone mode and can't be configured for YARN instead. Could have also been a comment/note box in the CDH5 install guide? Anyway, I'm still digging through the config options to see whether 0.9 can be configured for YARN instead of standalone. May get around to it this weekend, unless I crash and fall asleep, dang Worldcup! 😉
... View more
07-11-2014
06:16 AM
@srowen wrote: You might wait for CDH 5.1.0, which will be released very soon. This deploys Spark 1.0.0+patches on YARN for you. "Node" means a machine on which you want to run Spark. "Namenode" is for example an HDFS concept. It is not directly related to Spark. You may choose to run a Spark process on a machine that happens to host the namenode, or not. This is why Spark is not describe in terms of, say, HDFS roles. You do not need to start the Spark master on the HDFS namenode. You didn't have to start the MR jobtracker on the namenode either. On a cluster I ran, I put the master on the namenode just since it's a simple default choice. But any machine that can see HDFS and YARN would be fine; it need not even be running other Hadoop services. You can easily choose which machines are the Spark workers and which is the master in Cloudera Manager. The Spark master is not the same thing as a client. Its role is like that of the jobtracker really. It would not be run outside the cluster. You may be thinking of a driver for your specific app. The Apache distro is indeed a tarball and it's up to you to deploy it and run it. The role of CDH is to package, deploy and run things for you. The packaging is not at all the same, although the contents (scripts, binaries) are of course the same. You would not try to paste the raw tarball onto CDH nodes. If you want to get adventurous, you can go to all machines and dig into /opt/cloudera/parcels/CDH/lib/spark and replace binaries with a newer compiled version. That's a manual process, and I suppose not 100% guaranteed to work, but you can try it. Ah, thanks so much for that info! This covered a lot of my answers in one fell swoop. I am of course aware that namenode/datanodes and RM/NM are not synonymous, however for simplicity sake in put them together as they frequently are. My assumption was that the master was equivalent to a jobtracker, as you said and therefore would frequently be found on a NN/RM node, whereas the workers would go on a DN/NM/AM. Again, lumping those Hadoop components together. If not, how would the workers access files on HDFS, unless by streaming? What would that do to performance? It's unfortunate that the Apache docs don't give a really detailed view of the architecture and component interaction both within Spark as well as with the various Hadoop components. I think it's this that cause my confusion: if we're talking about Spark operating in a YARN environment, then there's a tacit implication of also having a "typical" underlying infrastructure based on your usual Hadoop cluster. Even if we take YARN out of the equation, if your data is on HDFS, then where do the Spark workers need to sit to ensure the maximum access speed? Talk of RDDs is all wel and good, but at some point your data is not all in memory, it's on platters whence it must get INTO memory! 🙂 When's 5.1 coming? 😉
... View more
07-11-2014
05:00 AM
My apologies to reawaken this dead thread, but the subject line was still applicable: I am trying to get Spark 1.0 to run on on CDH5 YARN. That means NOT the provided 0.9 RPMs. Still very new to Spark, I'm trying to get things right between what seems like somewhat contradictory information between Apache's Spark site and the CDH5 installation guide. This begins with precisely *what* is installed on *which* node in the cluster. Apache simply stated "on ALL nodes", without any distinction between node role - namenode? datanode? resource manager? node manager? application master? Then there's the client, say a development machine or similar. The Cloudera docs state to only start the master on one node - yet, which machine is best chose for this? Coming from Mapreduce this would imply the namenode. But it's not made clear that's the right choice. Should it instead be one of the datanodes (in this case they are also the node managers and application masters for YARN, in other words the worker bees of the cluster)? Or maybe it should be the client machine from which jobs are launched through shell or submit? The Apache cluster mode overview isn't much help either. I suppose that with Spark still being fresh and new, the doco has to catch up from a typical case of "this was totally obvious to the creators" for us mere mortals. What makes this more confusing is that the 1.0 tarball from Apache is one large directory, as opposed to the RPMs which break the packages down. I assume I could simply install the tarball on each and every machine on the cluster, from client to namenode/resource manager to datanode/node manager and just push out configuration settings to all, but that seems somewhat "sloppy". In the end it would be great if it were clear enough to take the 1.0 source and create my own RPMs, aligned with the current 0.9 CDH5 packages and install them only where needed, but for that I need a better understanding of what goes where. Any suggestions and pointers welcome! Thanks!
... View more