Created 05-11-2017 01:35 PM
Hi, I am carrying out testDFSIO performance tests under MapReduce2 / yarn in order to get a deeper understanding of yarn and its behavior when it comes to the number of mappers and reducers - on a single node sandbox running docker. I understand that the behavior should depend on the number of splits of the input data and the configuration determining the number of reduces. A) Number of mappers and reduces: My values are : hdfs getconf -confKey yarn.nodemanager.resource.memory-mb 2250 hdfs getconf -confKey yarn.nodemanager.resource.cpu-vcores 8 hdfs getconf -confKeymapreduce.map.memory.mb 250 hdfs getconf -confKeymapreduce.map.cpu.vcores 1 hdfs getconf -confKeymapreduce.reduce.cpu.vcores 1 mapreduce.reduce.memory.mb 250 Following the formulas here, I am expecting up to 8 simultaneous mappers and reducers. https://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ B) Input Splits: hdfs getconf -confKey dfs.blocksize 134217728 # 128MB hdfs getconf -confKey mapred.max.split.size - value missing so should be a really big number in order to matter hdfs getconf -confKey mapred.min.split.size 0 hdfs getconf -confKey dfs.replication 1 - as I am on a sandbox In my case I would expect the split size to be the 128MB - according to the formula result = max(min_split_size, min (max_split_size,dfs_blksize))b Now I have set up DFSIO runs in order to test the behavior - always reading and writing 10 GB of data. For example, the command to process 10 files of size 1GB is: $ hadoop jar hadoop-*test*.jar TestDFSIO -read|write -nrFiles 10 -fileSize 1000 I have carried out several experiment with corresponding reads and writes. cleaning up after each run: I am having problems in understanding the patterns: In particular, I would expect the number of splits change when the file sized exceeds split size. However, the number of splits perfectily corresponds to the number of files, even if a single file exceeds the split size of 128MB. I have collated a pdf to clarify that point. I would expect the splits to change in rows that are marked green. What am I getting wrong here? Thank you very much! Christian
Created 05-15-2017 07:20 AM
Alas, the formatting of my post was lost, so here it is, redone:
Hi, I am carrying out testDFSIO performance tests under MapReduce2 / yarn in order to get a deeper understanding of yarn and its behavior when it comes to the number of mappers and reducers - on a single node sandbox running docker.
I understand that the behavior should depend on the number of splits of the input data and the configuration determining the number of reduces.
A) Number of mappers and reduces: My values are :
hdfs getconf -confKey yarn.nodemanager.resource.memory-mb 2250
hdfs getconf -confKey yarn.nodemanager.resource.cpu-vcores 8
hdfs getconf -confKeymapreduce.map.memory.mb 250
hdfs getconf -confKeymapreduce.map.cpu.vcores 1
hdfs getconf -confKeymapreduce.reduce.cpu.vcores 1
mapreduce.reduce.memory.mb
250 Following the formulas here, I am expecting up to 8 simultaneous mappers and reducers.
B) Input Splits:
hdfs getconf -confKey dfs.blocksize 134217728 # 128MB
hdfs getconf -confKey mapred.max.split.size - value missing so should be a really big number in order to matter
hdfs getconf -confKey mapred.min.split.size 0
hdfs getconf -confKey dfs.replication 1 - as I am on a sandbox In my case I would expect the split size to be the 128MB - according to the formula result = max(min_split_size, min (max_split_size,dfs_blksize))
Now I have set up DFSIO runs in order to test the behavior - always reading and writing 10 GB of data. For example, the command to process 10 files of size 1GB is:
$ hadoop jar hadoop-*test*.jar TestDFSIO -read|write -nrFiles 10 -fileSize 1000
I have carried out several experiment with corresponding reads and writes. cleaning up after each run: I am having problems in understanding the patterns: In particular, I would expect the number of splits change when the file sized exceeds split size. However, the number of splits perfectily corresponds to the number of files, even if a single file exceeds the split size of 128MB. I have collated a pdf to clarify that point. I would expect the splits to change in rows that are marked green. What am I getting wrong here?
Thank you very much! Christian
Created 05-16-2017 09:20 AM