We are in the process of assessing what is required to scale/uplift/upsize the our current Hadoop cluster to cater to a greater number of concurrent users – we’re looking for your help to understand and review the assumptions, approach, and gain broader practice input from the experts on the considerations .We are using Zeppelin from which user would be running Queries using hiveserver2 . Our current cluster size , specification and few considerations regarding the use cases are listed below.
In order to maintain a good performance , we want to evaluate the upsize estimates to support parallel queries in cluster through 20+ concurrent user on Zeppelin.
I'm not offering a direct answer, but a couple of things to think about.
If your total data size is only 1.5TB, then it can all be kept in memory across a small number of nodes. Since its in memory, you won't need to use HDFS, you can use S3 or EBS. Use 8 nodes at 256G of memory each, then add vcores if you need more performance.