Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DistCp on too many files and big size increase the AM local file system

Highlighted

DistCp on too many files and big size increase the AM local file system

Super Collaborator

I'm trying to run DistCp first run, by creating snapshot S0 in the source and DistCp this S0 to the backup cluster, but since the DistCp'ed folder contain more than 3,000,000 files and 70 T, the running DistCp log is flooding the application master local file system, Is there a way to solve this, as a work around i'm thinking to DistCp the subfolder separetly, then creating the S0 snapshot in the source and distCped it. Any other smart ideas?

4 REPLIES 4

Re: DistCp on too many files and big size increase the AM local file system

Champion

I belive there is no way you could supress the logs because DistCp keeps logs of each file it attempts to copy as map output 

Re: DistCp on too many files and big size increase the AM local file system

Super Collaborator

so no way we can save the log at the HDFS instead the local file system for the application master?

 

Do you think of any other work around for this?

Re: DistCp on too many files and big size increase the AM local file system

Master Guru
What log lines specifically do you find among the log you observe filling the local filesystem? The MR2 AM does not log too much generally, unless you've placed it at a logger threshold below INFO. Perhaps the CopyCommitter class is the one logging a lot in your high file amount case? Could you check and confirm?

In any case, the below set of per-job properties can be used to control logging of MR2 jobs:

yarn.app.mapreduce.am.log.level - Controls the AM container's log level
mapreduce.map.log.level - Controls all Map tasks container's log level
mapreduce.reduce.log.level - Controls all Reduce tasks container's log level

All of the above default to INFO. You can override these once you have an actual idea of what log specifically fills your disk in your observed scenario.

A random example, using -D app-level overrides (which must appear before any DistCp specific arguments):

hadoop distcp -Dyarn.app.mapreduce.am.log.level=WARN -Dmapreduce.map.log.level=WARN -prb /src/dir /dst/dir

Re: DistCp on too many files and big size increase the AM local file system

Champion

@Harsh J if I understand correctly setting the root logger to Error or Fatal will likely to  produce less local logs when performing Hadoop distcp  assuming if everything goes nice and smooth.

Don't have an account?
Coming from Hortonworks? Activate your account here