Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDFS : Blocksize in ingest process

Highlighted

HDFS : Blocksize in ingest process

Explorer

Hi all,

My hdfs receive files from external parteners, it is a large file around 40GB, this file is split to each part around 80Mo.

My hdfs has blocksize set up to 128Mo.

My question is it better to split big file to each part around 128mo or i cant leave it at 80mo?

What's impact about performance and lost space ?

thanks

3 REPLIES 3
Highlighted

Re: HDFS : Blocksize in ingest process

Super Guru
@mayki wogno

It really depends on your record sizes. Imagine a file size 300 MB and 128 MB block sizes. Further imagine that file has only 3 records and each record is 100 MB (I am assuming 100 MB for simplicity - change this number to your specific record size and do the math) .

Given block size of 128 MB, each block will not get an entire record which means your record will be distributed in different file blocks. Your first block will get 100 MB which is the entire first record, but will also get 28 MB of the second record. Your second record will be distributed between block 1 and block 2 where block 2 will get 72 MB of data. The same block 2 will also get 56 MB of data from record 3 and then block 3 will have remaining data from record 3 which is 44 MB.

Now, when you run mapper to read block 1, the mapper cannot process because it only has partial record 2. This is where input split comes in. Input split will split the data around logical boundaries of the data. With input split of 100 MB, you will have 3 mappers and each mapper will get 1 record.

If I make input split 200 MB, then I will have only two mappers, where the first mapper processes 2 records and second one processes one record. This means less parallelism.

This is why input split should ideally equal block size. Now depending on your record size, ideally you want your split to be 128 MB and not 80 MB. Calculate that for your self and see if 80 MB split size makes sense. And if 80 MB makes sense, then why not for this particular file, change block size to 80 MB?

Highlighted

Re: HDFS : Blocksize in ingest process

Explorer

I Will give more explain about my workflow. My application receive 50 file part_***, each part file is around 80mb. I'm understanding that mapper run where are the data. So I've 50 mapper run in parallel. It works fine.

After mapper phase i 've 60 files with different size The problem comes when the result of mapper send to reducer - reducer is set to 50. This reducer phase is to much long to finish.

My cluster config 56 datanode, 64gb memories and 8 cpu.

How check the bottleneck issue?

Where the reducer is intantiate ? Next the data or where ressource is avalaible ? Thanks

Highlighted

Re: HDFS : Blocksize in ingest process

Explorer

I Will give more explain about my workflow. My application receive 50 file part_***, each part file is around 80mb. I'm understanding that mapper run where are the data. So I've 50 mapper run in parallel. It works fine.

After mapper phase i 've 60 files with different size The problem comes when the result of mapper send to reducer - reducer is set to 50. This reducer phase is to much long to finish.

My cluster config 56 datanode, 64gb memories and 8 cpu.

How check the bottleneck issue?

Where the reducer is intantiate ? Next the data or where ressource is avalaible ? Thanks

Don't have an account?
Coming from Hortonworks? Activate your account here