I want to set up a Hadoop cluster on my home network as a training exercise and need advice as to the configuration that best suits the hardware I have available. I would like to use a combination of both virtual (on Windows) and hardware nodes (Ubuntu) with approximately a terabyte of dedicated disk space for the filesystem. The machines I have (or will have) at my disposal are:
I would like to get the VMs up and running first; Will the Hortonworks sandbox VM allow me to create a multi-node cluster with both virtual and hardware nodes?
Will you run an Ubuntu VM on your Windows 7 box? If yes, in that case you should be able to install a cluster without issues (meaning some VMs and some hardware nodes). You cannot however, have a cluster with Windows machine on them.
Finally you shouldn't use sandbox. May be it will work but I cannot say because I have personally never done that. However, to install HDP using Ambari is literally a matter of one hour on four machines. Since you are doing it for the first time, it might take 2-3 hours but its much easier and better to just use Ambari and install HDP cluster rather than trying to work with sandbox which comes dedicated for one VM.
Thanks for your response!
I do indeed intend to run an Ubuntu VM from within my Windows 7 and 10 boxes - the latter two machines (laptop, 12 core) will have Ubuntu 16.04 LTS installed although the laptop will only have limited SSD space. I'd like to be able to use both the former VMs and the nodes I have running on native Ubuntu in the same cluster.
So, just for clarification, your advice is NOT to use the HDP Sandbox image but rather set up my own Ubuntu VMs and install my HDP cluster on those instead?
The assumption you will need 8GB on each server.
With the below hardware 8 core / 32GB (dual processor) with 4TB RAID5 array running Windows 7
Install virtualbox ,then 2 Ubuntu 16.04 (datanodes typically nodemanager,HBase regionserver,datanode,journalnode,ZK client etc)
4 core / 16GB with 1TB RAID5 running Windows 10
Install virtualbox ,then 2 Ubuntu 16.04 servers and install the Masternodes (namenode HA,Zookeeper,HBase master etc) note minimum 3 zookeeper servers so the third one should go 12core /32 GB hosts
2 core / 8GB Thinkpad running Ubuntu 16.04 (need to install Linux)
This could be your Managementnode/Edgenode (Ambari server, all client software)
12 core / 32GB (dual processor) with 1TB RAID5 array running Ubuntu 16.04 (need to assemble hardware and install Linux) Install virtualbox ,then 2 Ubuntu 16.04 (datanodes typically nodemanager,HBase regionserver,datanode,journalnode,ZK clients etc)
Hope that helps
Thanks for your response!
The latter two machines (Thinkpad and 12 core machine) will have native Ubuntu installed; since it is my intention to install HDP directly on these machines will I still need virtualbox for the 12 core?
Also, would it be possible for me to get the cluster running using only the first three machines (2 VMs and Thinkpad) as these machines are already up and running - and "add" the 12 core to the cluster later on (once it is built)?
Of course you can add especially a datanode at a later stage. With 3 machines (2VM's and a ThinkPad ) you can have a working mini cluster
Ensure you have a static IP for all the hosts I can imagine you unplugging you ThinkPad, ideally, this should be your Ambari server and one VM a Masternode and the other a datanode.
Depending on the Memory available on the Master deploy as follows
Ambari Server (edgenode client software)
YARN_CLIENT, ZOOKEEPER_CLIENT, RANGER_ADMIN, RANGER_USERSYNC, METRICS_MONITOR, METRICS_COLLECTOR, ZEPPELIN_MASTER, INFRA_SOLR, INFRA_SOLR_CLIENT, HBASE_CLIENT, HDFS_CLIENT, TEZ_CLIENT, YARN_CLIENT, HIVE_CLIENT, MAPREDUCE2_CLIENT zookeeper Server (Must have aleast 3 )
Primary NameNode, - MapReduce2 - YARN ResourceManager - HBase Master - Hive Metastore - Ranger(optional) - HiveServer2 - WebHCat server - Spark History server - zookeeper Server (Must have aleast 3 ) - Timeline Server - Zeppelin Notebook (optional)
## Datanode1 -
DataNodes, - NodeManager, - HBase RegionServers - zookeeper server
Could I use a Docker container (ex. https://community.hortonworks.com/repos/75668/a-multi-node-docker-cluster-platform-to-quickly-sp.htm... ) configured as master and data nodes on each of my Windows machines instead of and Ubuntu 16.04 image via Virtualbox?
Surely Docker is a superb technology to run isolated components.easier to manage, build and rapidly deploy. Docker is more popular among developers but less applied to production systems, so operating it at large scale is a road less traveled. It's good to get your hands dirty to learn the inside trade of massive cluster deployments.
Keep me posted. If you are satisfied with my previous answers then you can accept and close the initial thread and I will happily help on the HDP docker journey :-).