Support Questions
Find answers, ask questions, and share your expertise

Best practices between size block , size file and replication factor in HDFS

Super Collaborator

Hi:

If I have files in HDFS with 500mb diary, which recomend me about size blocks, replication factor??

Thanks.

1 ACCEPTED SOLUTION

It really depends on your scenario which block size is better. As a simple example:

Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task.

So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes.

If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster.

If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes.

If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time.

It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc.

ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is:

It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not.

For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.

View solution in original post

11 REPLIES 11

Mentor

@Roberto Sancho

1. 128mb block

2.larger files the better

3. Minimum factor of 3

It's also good to merge these files either using HAR or rewrite into bigger files

It really depends on your scenario which block size is better. As a simple example:

Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task.

So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes.

If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster.

If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes.

If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time.

It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc.

ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is:

It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not.

For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.

Super Collaborator

Hi:

In whicj cases is better have 256MB block size?

@Benjamin Leonhardi Nicely explained!!! Thanks!

(part 1 of 2) Just a quick comment on replication: It is completely separate from issues of block size and file size. It has been established (there's a white paper somewhere) that replication = 3 makes it astronomically unlikely that you'll ever actually lose data within a single site due to hardware failure, as long as

  1. HDFS has plenty of datanodes and drives provisioned,
  2. dead drives on datanodes get replaced within a day or two of dying, and
  3. your site integrity doesn't suffer a catastrophic site loss.

I need to find the paper sometimes. I actually had that question at a couple customers. I was sure 3x is safe but I didn't have any data to back me up.

(part 2 of 2) Increasing replication to 4 or more only makes sense in DR situations, where you want, say, 2 copies at each of two sites, but then you should be using Falcon or similar to maintain the DR copies, not HDFS replication. On the other hand, replication = 2 definitely leaves your data vulnerable to "it can't happen to me" scenarios -- it only takes the wrong two disk drives to fail at the same time, and at some point they will. So decreasing replication below 3 only makes sense if it is low value data or you're using RAID or other highly reliable storage to protect your data.

Super Collaborator

Hi:

I dont undestand why:

"If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster."

If you can run the query in one go. I.e. if you have 50 task slots free in your cluster. It would be theoretically fastest if you could run 50 tasks at the same time. So for small data amounts, small block sizes will result in more tasks and more parallelity ergo more speed. So smaller blocks guarantee high parallelity and fast response times. But they have a task creation overhead. I tried to give the example in which concrete cases you would see better results with small or big blocks.

; ;