Support Questions
Find answers, ask questions, and share your expertise

distcp - How to determine number of mappers used by distcp job(s) at cluster level?

Rising Star

distcp - How to determine number of mappers used by distcp job(s) at cluster level?

Sometime we run into network bandwidth issue caused by distcp job(s) running too many mappers or too many distcp jobs.

Our plan is to trigger DataDog alert when the total number of mappers used by distcp jobs (at cluster level) reach at defined number (ex: 100). We are open to explore the "-bandwidth" option.

We have many users who will be submitting a job from diff edge nodes. so, we don't want to use the "ps" command at server level.

Please help us rectify the issue. Thanks in advance.

1 REPLY 1

Rising Star

I was able to find out number of mappers used by distcp using below command:

 

MAPPERS=`yarn container -list $app | grep 'Total number of containers' | awk -F: '{print $2}'`

 

Next step is to only look for distcp job which is doing a copy from/to hdfs (and not to S3).

What's the best way to get around it?

 

; ;