11-20-2016 06:35 AM
For instance, at the end of a mapreduce job the following is reported:
File System Counters
FILE: Number of bytes read=42924972694
Versus this number....
HDFS: Number of bytes read=272906990810
My assumption is that you want to miminize file system reads. If so, what are the tuning parements, configuration items, etc. that contribute to this number or is it all in the mapreduce job itself? I'm trying to reconcile performance differences across varoius clusters.
01-17-2017 05:07 PM
So the answer is really that what you are noticing is job specific. Depending on the job the mappers/reducers will write more or less bytes to local file compared to the hdfs.
In your mapper case, you have a similar amount of data that was read in from both local and HDFS locations, there is no problem there. Your Mapper code just happens to need to read about the same amount of data locally as it reads from HDFS. Most of the time the Mappers are being used to analyze an amount of data greater than it's RAM, so it's not surprising to see it possibly writing the data it gets from the HDFS to a local drive. The number of bytes read from HDFS and local are not always going to look like they sum up to the local write size (which they don't even in your case).