We need to identify SLAs and KPIs for our Hadoop Platform. Would like to have the SLAs and KPIs at the platform level (not at the individual application that may be using our Hadoop Platform). Any suggestions on the SLAs/KPIs we can consider?
Your exercise in SLAs and KPIs unfortunately is too broad to be meaningul. The performance definitions beneath SLAs and KPIs for hadoop depend on many conditions: cluster size (larger clusters are faster); configurations (e.g. Hive configured for LLAP or for batch); size of data (e.g. job against 1 GB vs 1 PB); concurrent users (e.g. job running with 100 concurrent users vs 1); YARN configurations (e.g. your job is running in a YARN queue that is allocated 50% total memory vs 5% with no preemption of other queues); and so, and so on (many more factors .. e.g. is the data compressed?).
You can only define an SLA for a specific job and query pattern against a configured YARN resource allocation and usually data size. It gets very specific. For example, the performance (time to complete) of a Hive query with a specific join of two tables and < 10 concurrent users and configured for LLAP at a certain cluster size and 1 PB of data will be different from that of an HBase query running against that data flattened into a single table and different from that loading the data in the first place. You first need to understand the job or query you are running for a given cluster and YARN configuration then define SLAs around that (and perhaps go back and optimize and change YARN settings if you need tighter SLAs).