Say that I need to have a 3 or 4 node cluster. What is the best way to set it up? How do I distribute the services through the nodes.
Apart from the basic serivices like HDFS,MapReduce,YARN etc.,I definitely require Spark and Zeppelin on the cluster. Is it okay to install Spark on each of the nodes?
How do I proceed with choosing services for my cluster?
In other words, can someone suggest a Blueprint.json file for a 3 and 4 node cluster setup
how many services you can install on one node depends always on the sizing of the node and the workload. But all services I would configure at least to be available on two nodes, so you have already a parallel config that can be extended on more nodes when the workload requires it. Some service are consuming more disk I/O, others more CPU load, or network I/O or RAM. It is normally a good idea to put together services with different characteristics, ie. CPU intense with disk I/O intense.
In any case I wouldn't really go for only 3-4 nodes. I would propose a minimum of 6 nodes + the name node (and eventually a HA name node). When I assume you take just 3 data nodes (+1 name node), each outage of one node, would increase the load of the two remaining nodes by 50%. Also the default replication factor is 3, so you would have a HDFS capacity not higher than the FS capacity on one node. If you are running a demo cluster it still might be acceptable.
For more specific hints, you will need to describe the actual uses case in some more detail.