I need to set up a Hortonworks cluster for an IoT Case. Alle devices will send data to the cluster and then we want do analysis with Spark (MLib) and implement a real-time monitoring with Spark queries on a web GUI (querying this data on user input). For data ingestion we'll try out Apache NiFi. The data load is ~2GB per day (without replication, "real" load).
The following other components are also planned to be installed:
For the cluster we want start with 3 (more shouldn't be a problem if needed) nodes with the following speccs:
We use VMWare, so upscaling shouldn't be a problem but hardware servers are unfortunatly not possible in this case.
Now to my question: Are there "best practices" or tips how to spread and deploy the above named frameworks/services? What is a good choice? Just all frameworks on every node? Should we use additional nodes? Should we use a dedicated node just for one of the above frameworks?
Thanks in advance!
I am sharing few links which may help you setting up your cluster. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_cluster-planning-guide/bk_cluster-planni... https://community.hortonworks.com/articles/62667/zookeeper-sizing-and-placement-draft.html
Apart from this I recommend you to go with at least 8/16 core machines.
Thank you for the links, they're very helpful.
What's about the amount of nodes? What do you think would be a good start? Maybe 3 nodes are to few.
The number of Nodes depends on your use case/POC, Data Volume, Cluster usage, High Availability etc.,
I feel it is good to start with 5 nodes (2 Master and 3 Data Nodes). Hence you will have option to enable HA and you can balance work load with 2 Master nodes and you can replicate the data with replication factor as 3. After this you can add nodes as and when needed.