Support Questions

VidyaSargur · ‎11-10-2024

Hello everyone,

I am Emmanuel Katto currently working on evaluating the disk I/O of our CDH (Cloudera Distribution for Hadoop) cluster, which consists of several hundred bare metal machines. I would like to obtain the following values for each application within a certain period of time:

total_io_mb
mapreduce_inputBytes
mapreduce_outputBytes

These values, I believe, are logged in the YARN logs, but I’m not sure how to configure YARN or the logging system to ensure these values are written in the log files.

So far, through Cloudera Manager, we’ve only been able to get metrics like the yarn_application_hdfs_bytes_read_rate, but that’s not enough for evaluating overall disk I/O.

Could anyone share any advice or alternatives on how to extract these specific I/O values for each application? Also, if there’s a way to configure YARN or Cloudera Manager to write these metrics into the logs, I’d appreciate your insights.

Thanks in advance!

Regards

Emmanuel Katto

VidyaSargur · ‎11-12-2024

@emmanuelkatto24, Welcome to our community! To help you get the best possible answer, I have tagged our Airflow experts @smdas who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

ggangadharan · ‎03-06-2025

Assuming it's a MapReduce job, since you're looking for information related to MapReduce I/O counters.

Script to calculate the counter info.

[hive@node4 ~]$ cat get_io_counters.sh
#!/bin/bash

# Ensure a job ID is provided
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <job_id>"
    exit 1
fi

JOB_ID=$1

# Extract I/O counters from the MapReduce job status
mapred job -status "$JOB_ID" | egrep -A 1 'File Input Format Counters|File Output Format Counters' | awk -F'=' '
  /File Input Format Counters/ {getline; bytes_read=$2}
  /File Output Format Counters/ {getline; bytes_written=$2}
  END {
    total_io_mb = (bytes_read + bytes_written) / (1024 * 1024)
    printf "BYTES_READ=%d\nBYTES_WRITTEN=%d\nTOTAL_IO_MB=%.2f\n", bytes_read, bytes_written, total_io_mb
  }'

[hive@node4 ~]$

Sample Output

[hive@node4 ~]$ ./get_io_counters.sh job_1741272271547_0007
25/03/06 15:38:34 INFO client.RMProxy: Connecting to ResourceManager at node3.playground-ggangadharan.coelab.cloudera.com/10.129.117.75:8032
25/03/06 15:38:35 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
BYTES_READ=288894
BYTES_WRITTEN=348894
TOTAL_IO_MB=0.61
[hive@node4 ~]$

Support Questions

How to Obtain total_io_mb, mapreduce_inputBytes, and mapreduce_outputBytes for Each Application in Yarn Logs?