I have a 4 node cluster configured to have 1 Namenode and 3 datanodes. Im performing a TPCH benchmark and i would like to know how much data you think my cluster can handle without affecting query response times. The nodes have 16gb of ram each and 8 cores. My total amount of disk available is ~700GB.
Need to know the block size, how the nodes are arranges, network traffic/speed, how complex the transformation logics are, Are you trying to perform joins?, does it have unique key value pair, How the data is stored?
These are few basic question which you need to ask to understand cluster performances.