Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

distcp - How to determine number of mappers used by distcp job(s) at cluster level?

avatar
Expert Contributor

distcp - How to determine number of mappers used by distcp job(s) at cluster level?

Sometime we run into network bandwidth issue caused by distcp job(s) running too many mappers or too many distcp jobs.

Our plan is to trigger DataDog alert when the total number of mappers used by distcp jobs (at cluster level) reach at defined number (ex: 100). We are open to explore the "-bandwidth" option.

We have many users who will be submitting a job from diff edge nodes. so, we don't want to use the "ps" command at server level.

Please help us rectify the issue. Thanks in advance.

1 REPLY 1

avatar
Expert Contributor

I was able to find out number of mappers used by distcp using below command:

 

MAPPERS=`yarn container -list $app | grep 'Total number of containers' | awk -F: '{print $2}'`

 

Next step is to only look for distcp job which is doing a copy from/to hdfs (and not to S3).

What's the best way to get around it?