Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

Master Collaborator

Hi:

If I have files in HDFS with 500mb diary, which recomend me about size blocks, replication factor??

Thanks.

34,891 Views

1 ACCEPTED SOLUTION

Master Guru

It really depends on your scenario which block size is better. As a simple example:

Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task.

So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes.

If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster.

If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes.

If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time.

It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc.

ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is:

It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not.

For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.

View solution in original post

15,810 Views

11 REPLIES 11

Master Mentor

@Roberto Sancho

1. 128mb block

2.larger files the better

3. Minimum factor of 3

It's also good to merge these files either using HAR or rewrite into bigger files

15,033 Views

Master Guru

It really depends on your scenario which block size is better. As a simple example:

Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task.

So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes.

If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster.

If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes.

If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time.

It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc.

ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is:

It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not.

For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.

15,811 Views

Master Collaborator

Hi:

In whicj cases is better have 256MB block size?

15,033 Views

Master Mentor

@Benjamin Leonhardi Nicely explained!!! Thanks!

15,033 Views

Guru

(part 1 of 2) Just a quick comment on replication: It is completely separate from issues of block size and file size. It has been established (there's a white paper somewhere) that replication = 3 makes it astronomically unlikely that you'll ever actually lose data within a single site due to hardware failure, as long as

HDFS has plenty of datanodes and drives provisioned,
dead drives on datanodes get replaced within a day or two of dying, and
your site integrity doesn't suffer a catastrophic site loss.

15,033 Views

Master Guru

I need to find the paper sometimes. I actually had that question at a couple customers. I was sure 3x is safe but I didn't have any data to back me up.

15,033 Views

Guru

@Benjamin Leonhardi, please see this post: https://community.hortonworks.com/articles/16719/hdfs-data-durability-and-availability-with-replica..... Please up-vote it if you like it 🙂

15,033 Views

Guru

(part 2 of 2) Increasing replication to 4 or more only makes sense in DR situations, where you want, say, 2 copies at each of two sites, but then you should be using Falcon or similar to maintain the DR copies, not HDFS replication. On the other hand, replication = 2 definitely leaves your data vulnerable to "it can't happen to me" scenarios -- it only takes the wrong two disk drives to fail at the same time, and at some point they will. So decreasing replication below 3 only makes sense if it is low value data or you're using RAID or other highly reliable storage to protect your data.

15,033 Views

Master Collaborator

Hi:

I dont undestand why:

"If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster."

15,033 Views

Announcements

Community Announcements

September 2025 Community Highlights

What's New @ Cloudera

Upgrade Your Spark Experience: Introducing CDS 3.5 for Cloud...

Community Announcements

August 2025 Community Highlights

Community Announcements

July 2025 Community Highlights

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics - Kubernetes Operato...