Hello everyone! I am new to Hadoop/Cloudera world, I need to setup a Cloudera cluster on Microsoft Azure cloud.
If I understood correctly there are two methods to install Cloudera on a cluster: using Cloudera Manager or thorugh a manual installation.
According to this schema it seems it is needed a dedicated machine for Cloudera Manager and 3 Master Nodes.
But in this table it seems I can install Cloudera Manager directly on the Master Node.
So here are my doubts/questions:
1) Is it necessary to have Cloudera Manager in a dedicated machine (if yes, why)? Or can it be installed directly on the master node?
2) Why there are 3 master nodes? From what I understood, 2 master nodes can be used for high availability (they are the mirror of each other with the same configuration and services and can used for an hot switch). What is the purpose of the third master node and why it is different from the other two?
3) What is the purpose of the Cloudera Director and which are the differences from the Cloudera Managera? I've read that it can be used for automated deployments to the cloud but it is not clear to me for what exactly I could use it.
Thanks in advance for any information.
Master nodes: Yes the HDFS has two master roles, but the HDFS, YARN, Hbase, Failover controller and many other applications depends on Zookeeper. And this has to be deployed on 3 nodes. ZK is then voting who should be the leader, and therefore you need and odd number. It can be 1,3,5,7 and so on. But if you choose just one ZK, then if it fails, everything fails. Choosing 3 gives you the possibility to tolerate 1 node failure. Choosing 5, 2 nodes failures and so on.
Cloudera Manager -> yes it can be on one of those masters, specially on that one where HDFS is not deployed. But in "complex" clusters, where there is a Hbase master, Kudu master, Sentry, Hive, etc etc, there are so many master roles that it is recommended to put CM to a different machine. CM can eat a lot of CPU and IO, because it collects lots of data, for charts/reports.
And if you have many clusters, you can have one CM, as a management node and many masters/clusters.
Cloudra Directory is just a setup node. I will not advise to keep it up, I would just use it for the deployemnt and thats it. Because later on you realize that many of those changes what you need to do one the cluster are not covered by Cloudera Director. So it gets "unsychronized"..
Thank you very much for the answer!
I am still a bit confused about Cloudera Director. I've just configured a cluster on Microsoft Azure using Cloudera Manager. Through CM I've installed all the CDH components and services and the cluster is up and running. Thus if I can use the Cloudera Manager to setup and configure all the CDH components and the cluster, for what I should use the Director?
Ok, a bit more clear. From what I can see here there are 3 methods of installation for Cloudera (path A/B/C) and in none of them the Director is used. Is there any resource/docs/tutorial that explains how to perform an installation of a Cloudera-cluster using Director?
Thanks for the help