Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best practices between size block , size file and replication factor in HDFS

avatar
Master Collaborator

Hi:

If I have files in HDFS with 500mb diary, which recomend me about size blocks, replication factor??

Thanks.

1 ACCEPTED SOLUTION

avatar
Master Guru

It really depends on your scenario which block size is better. As a simple example:

Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task.

So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes.

If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster.

If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes.

If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time.

It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc.

ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is:

It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not.

For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.

View solution in original post

11 REPLIES 11

avatar
Master Guru

If you can run the query in one go. I.e. if you have 50 task slots free in your cluster. It would be theoretically fastest if you could run 50 tasks at the same time. So for small data amounts, small block sizes will result in more tasks and more parallelity ergo more speed. So smaller blocks guarantee high parallelity and fast response times. But they have a task creation overhead. I tried to give the example in which concrete cases you would see better results with small or big blocks.

avatar
Master Mentor