Support Questions

Find answers, ask questions, and share your expertise

number of splits, file size on single core sandbox

New Contributor

Hi, I am carrying out testDFSIO performance tests under MapReduce2 / yarn in order to get a deeper understanding of yarn and its behavior when it comes to the number of mappers and reducers - on a single node sandbox running docker. I understand that the behavior should depend on the number of splits of the input data and the configuration determining the number of reduces. A) Number of mappers and reduces: My values are : hdfs getconf -confKey yarn.nodemanager.resource.memory-mb 2250 hdfs getconf -confKey yarn.nodemanager.resource.cpu-vcores 8 hdfs getconf -confKeymapreduce.map.memory.mb 250 hdfs getconf -confKeymapreduce.map.cpu.vcores 1 hdfs getconf -confKeymapreduce.reduce.cpu.vcores 1 mapreduce.reduce.memory.mb 250 Following the formulas here, I am expecting up to 8 simultaneous mappers and reducers. https://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ B) Input Splits: hdfs getconf -confKey dfs.blocksize 134217728 # 128MB hdfs getconf -confKey mapred.max.split.size - value missing so should be a really big number in order to matter hdfs getconf -confKey mapred.min.split.size 0 hdfs getconf -confKey dfs.replication 1 - as I am on a sandbox In my case I would expect the split size to be the 128MB - according to the formula result = max(min_split_size, min (max_split_size,dfs_blksize))b Now I have set up DFSIO runs in order to test the behavior - always reading and writing 10 GB of data. For example, the command to process 10 files of size 1GB is: $ hadoop jar hadoop-*test*.jar TestDFSIO -read|write -nrFiles 10 -fileSize 1000 I have carried out several experiment with corresponding reads and writes. cleaning up after each run: I am having problems in understanding the patterns: In particular, I would expect the number of splits change when the file sized exceeds split size. However, the number of splits perfectily corresponds to the number of files, even if a single file exceeds the split size of 128MB. I have collated a pdf to clarify that point. I would expect the splits to change in rows that are marked green. What am I getting wrong here? Thank you very much! Christian

analyse-dfsio-write-test.pdf

2 REPLIES 2

New Contributor

Alas, the formatting of my post was lost, so here it is, redone:

Hi, I am carrying out testDFSIO performance tests under MapReduce2 / yarn in order to get a deeper understanding of yarn and its behavior when it comes to the number of mappers and reducers - on a single node sandbox running docker.

I understand that the behavior should depend on the number of splits of the input data and the configuration determining the number of reduces.

A) Number of mappers and reduces: My values are :

hdfs getconf -confKey yarn.nodemanager.resource.memory-mb 2250

hdfs getconf -confKey yarn.nodemanager.resource.cpu-vcores 8

hdfs getconf -confKeymapreduce.map.memory.mb 250

hdfs getconf -confKeymapreduce.map.cpu.vcores 1

hdfs getconf -confKeymapreduce.reduce.cpu.vcores 1

mapreduce.reduce.memory.mb

250 Following the formulas here, I am expecting up to 8 simultaneous mappers and reducers.

B) Input Splits:

hdfs getconf -confKey dfs.blocksize 134217728 # 128MB

hdfs getconf -confKey mapred.max.split.size - value missing so should be a really big number in order to matter

hdfs getconf -confKey mapred.min.split.size 0

hdfs getconf -confKey dfs.replication 1 - as I am on a sandbox In my case I would expect the split size to be the 128MB - according to the formula result = max(min_split_size, min (max_split_size,dfs_blksize))

Now I have set up DFSIO runs in order to test the behavior - always reading and writing 10 GB of data. For example, the command to process 10 files of size 1GB is:

$ hadoop jar hadoop-*test*.jar TestDFSIO -read|write -nrFiles 10 -fileSize 1000

I have carried out several experiment with corresponding reads and writes. cleaning up after each run: I am having problems in understanding the patterns: In particular, I would expect the number of splits change when the file sized exceeds split size. However, the number of splits perfectily corresponds to the number of files, even if a single file exceeds the split size of 128MB. I have collated a pdf to clarify that point. I would expect the splits to change in rows that are marked green. What am I getting wrong here?

Thank you very much! Christian

New Contributor

And in addition I have uploaded the wrong result files.

So this time with the appropriate link.

Extremely sorry 😞

Christian

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.