Please find answers inline -
1. a node means a server, right? No VM'S ?
- Node means server. A server can be physical hardware or virtual machine also.
2. How many servers I would need to add to have a healthy cluster
- It depends upon what type of configuration you use for production. Generally a broader question to discuss. For Master services I would recommend to deploy on individual node and slave nodes as per your requirement.
In case of HA you need to revisit placement of the above services.
master1 - Active NN,ZK,JN
master1 - Standby NN, ZK, JN,RM, AM,HS
master1 - Ambari, ZK, HIVE,SQOOP,OOZIE,HUE,Ranger,etc..
Slave Nodes - DN,N,etc..
3. Which of the above mentioned services should be co-located?
- For HDFS make sure JN should run most probably on both namenodes, also if possible you should have dedicated disk for JN and ZK.
4. What should be the distribution like?
- You can go for n-1 distribution [where is n=latest stable release from hdp]
You can migrate services after installation.