I am running terasort & teragen in my 4 node environment. Is there any baseline to test my performance? I can't find my google'ing around.
No, there are so many factors like CPUs, RAM, number of type of disks, which machines, network, on-prem or cloud, number of worker and master nodes, hdfs, mapred, and Yarn settings, etc. that any "baseline" quickly becomes misleading. You can measure time on your cluster, and then compare with times on similar clusters if you have them. Serious users usually have 2 or more clusters, like Prod, Dev, Preprod, etc.
As Predrag mentioned, you cannot compare the cluster performance or there is no benchmark around, but, you can take an initial benchmark of your cluster and can compare it with similar setup or with your own if you see any issues. I would also setup HiBench which also runs a performance test on Streams as well as Spark.
Hi @Sunile Manjee,
We were trying to compare the baseline to test the performance, against CDH by running teragen and terasort but not for 1 tera byte of data though, since we were using SE cloud I was not able to find resources to spin up more Vm's. Not really sure if the attached can help.
Here are the details of the cluster,
HDP measures on a 4 node cluster with m3.medium:
CDH measures on a 5 node cluster:
Thank you all for the great replies. I agree with all; however, i would like to see some share results with specifics. Number of node, yarn specs, etc etc. A start place to compare against would be great. yes I understand it would be comparing apples to oranges. However that is not a bad place to start.