Reply
Expert Contributor
Posts: 357
Registered: ‎01-25-2017

DistCp on too many files and big size increase the AM local file system

I'm trying to run DistCp first run, by creating snapshot S0 in the source and DistCp this S0 to the backup cluster, but since the DistCp'ed folder contain more than 3,000,000 files and 70 T, the running DistCp log is flooding the application master local file system, Is there a way to solve this, as a work around i'm thinking to DistCp the subfolder separetly, then creating the S0 snapshot in the source and distCped it. Any other smart ideas?

Champion
Posts: 768
Registered: ‎05-16-2016

Re: DistCp on too many files and big size increase the AM local file system

I belive there is no way you could supress the logs because DistCp keeps logs of each file it attempts to copy as map output 

Expert Contributor
Posts: 357
Registered: ‎01-25-2017

Re: DistCp on too many files and big size increase the AM local file system

so no way we can save the log at the HDFS instead the local file system for the application master?

 

Do you think of any other work around for this?

Posts: 1,885
Kudos: 423
Solutions: 298
Registered: ‎07-31-2013

Re: DistCp on too many files and big size increase the AM local file system

What log lines specifically do you find among the log you observe filling the local filesystem? The MR2 AM does not log too much generally, unless you've placed it at a logger threshold below INFO. Perhaps the CopyCommitter class is the one logging a lot in your high file amount case? Could you check and confirm?

In any case, the below set of per-job properties can be used to control logging of MR2 jobs:

yarn.app.mapreduce.am.log.level - Controls the AM container's log level
mapreduce.map.log.level - Controls all Map tasks container's log level
mapreduce.reduce.log.level - Controls all Reduce tasks container's log level

All of the above default to INFO. You can override these once you have an actual idea of what log specifically fills your disk in your observed scenario.

A random example, using -D app-level overrides (which must appear before any DistCp specific arguments):

hadoop distcp -Dyarn.app.mapreduce.am.log.level=WARN -Dmapreduce.map.log.level=WARN -prb /src/dir /dst/dir
Highlighted
Champion
Posts: 768
Registered: ‎05-16-2016

Re: DistCp on too many files and big size increase the AM local file system

[ Edited ]

@Harsh J if I understand correctly setting the root logger to Error or Fatal will likely to  produce less local logs when performing Hadoop distcp  assuming if everything goes nice and smooth.

Announcements