Support Questions
Find answers, ask questions, and share your expertise

number of splits, file size on single core sandbox

number of splits, file size on single core sandbox

New Contributor

Hi, I am carrying out testDFSIO performance tests under MapReduce2 / yarn in order to get a deeper understanding of yarn and its behavior when it comes to the number of mappers and reducers - on a single node sandbox running docker. I understand that the behavior should depend on the number of splits of the input data and the configuration determining the number of reduces. A) Number of mappers and reduces: My values are : hdfs getconf -confKey yarn.nodemanager.resource.memory-mb 2250 hdfs getconf -confKey yarn.nodemanager.resource.cpu-vcores 8 hdfs getconf -confKeymapreduce.map.memory.mb 250 hdfs getconf -confKeymapreduce.map.cpu.vcores 1 hdfs getconf -confKeymapreduce.reduce.cpu.vcores 1 mapreduce.reduce.memory.mb 250 Following the formulas here, I am expecting up to 8 simultaneous mappers and reducers. https://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ B) Input Splits: hdfs getconf -confKey dfs.blocksize 134217728 # 128MB hdfs getconf -confKey mapred.max.split.size - value missing so should be a really big number in order to matter hdfs getconf -confKey mapred.min.split.size 0 hdfs getconf -confKey dfs.replication 1 - as I am on a sandbox In my case I would expect the split size to be the 128MB - according to the formula result = max(min_split_size, min (max_split_size,dfs_blksize))b Now I have set up DFSIO runs in order to test the behavior - always reading and writing 10 GB of data. For example, the command to process 10 files of size 1GB is: $ hadoop jar hadoop-*test*.jar TestDFSIO -read|write -nrFiles 10 -fileSize 1000 I have carried out several experiment with corresponding reads and writes. cleaning up after each run: I am having problems in understanding the patterns: In particular, I would expect the number of splits change when the file sized exceeds split size. However, the number of splits perfectily corresponds to the number of files, even if a single file exceeds the split size of 128MB. I have collated a pdf to clarify that point. I would expect the splits to change in rows that are marked green. What am I getting wrong here? Thank you very much! Christian

analyse-dfsio-write-test.pdf

2 REPLIES 2
Highlighted

Re: number of splits, file size on single core sandbox

New Contributor

Alas, the formatting of my post was lost, so here it is, redone:

Hi, I am carrying out testDFSIO performance tests under MapReduce2 / yarn in order to get a deeper understanding of yarn and its behavior when it comes to the number of mappers and reducers - on a single node sandbox running docker.

I understand that the behavior should depend on the number of splits of the input data and the configuration determining the number of reduces.

A) Number of mappers and reduces: My values are :

hdfs getconf -confKey yarn.nodemanager.resource.memory-mb 2250

hdfs getconf -confKey yarn.nodemanager.resource.cpu-vcores 8

hdfs getconf -confKeymapreduce.map.memory.mb 250

hdfs getconf -confKeymapreduce.map.cpu.vcores 1

hdfs getconf -confKeymapreduce.reduce.cpu.vcores 1

mapreduce.reduce.memory.mb

250 Following the formulas here, I am expecting up to 8 simultaneous mappers and reducers.

B) Input Splits:

hdfs getconf -confKey dfs.blocksize 134217728 # 128MB

hdfs getconf -confKey mapred.max.split.size - value missing so should be a really big number in order to matter

hdfs getconf -confKey mapred.min.split.size 0

hdfs getconf -confKey dfs.replication 1 - as I am on a sandbox In my case I would expect the split size to be the 128MB - according to the formula result = max(min_split_size, min (max_split_size,dfs_blksize))

Now I have set up DFSIO runs in order to test the behavior - always reading and writing 10 GB of data. For example, the command to process 10 files of size 1GB is:

$ hadoop jar hadoop-*test*.jar TestDFSIO -read|write -nrFiles 10 -fileSize 1000

I have carried out several experiment with corresponding reads and writes. cleaning up after each run: I am having problems in understanding the patterns: In particular, I would expect the number of splits change when the file sized exceeds split size. However, the number of splits perfectily corresponds to the number of files, even if a single file exceeds the split size of 128MB. I have collated a pdf to clarify that point. I would expect the splits to change in rows that are marked green. What am I getting wrong here?

Thank you very much! Christian

Highlighted

Re: number of splits, file size on single core sandbox

New Contributor

And in addition I have uploaded the wrong result files.

So this time with the appropriate link.

Extremely sorry :-(

Christian