We see very few mappers created for discp copies. Are these mappers being allocated at the block level or at the file level? I.e., does a mapper copy a physical block or does it copy an entire logical file?
You can specify the number of mappers that will be used for the distcp job.
|Maximum number of simultaneous copies
|Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput.
If nothing is specified, the default should be 20 map tasks.
/* Default number of maps to use for DistCp */ public static final int DEFAULT_MAPS = 20;
Thanks for gettng back. Yes--I'm aware of the -m option, but it appears from the documentation that the mappers get a list of HDFS level files and work on these. I'm trying to find out if my understanding is accurate: that unlike a typical map reduce job that deals in single blocks or splits, the distcp maps each get the URI of an entire file or files to copy. Therefore, you might have hundreds of blocks, but if it's all one file, the same mapper will handle all. Is this the case?
Does a mapper copy a physical block or does it copy an entire logical file?
DistCp map tasks are responsible for copying a list of logical files. This differs from typical MapReduce processing, where each map task consumes an input split, which maps 1:1 (usually) to an individual block of an HDFS file. The reason for this is that DistCp needs to preserve not only the block data at the destination, but also the metadata that links an inode with a named path to all of those blocks. Therefore, DistCp needs to use APIs that operate at the file level, not the block level.
The overall architecture of DistCp is to generate what it calls a "copy listing", which is a list of files from the source that need to be copied to the destination, and then partition the work of copying the files in the copy listing to multiple mappers. The Apache documentation for DistCp contains more details on the policies involved in this partitioning.
It is possible that tuning the number of mappers as described in the earlier answer could improve throughput. Particularly for a large cluster at the source, I'd expect increasing the number of mappers to increase overall parallelism and leverage the NIC available on multiple nodes for the data transfer. It's difficult to give general advice on this though. It might take experimentation to tune it for the particular workload involved.