Created 02-11-2016 08:52 AM
Hi:
If I have files in HDFS with 500mb diary, which recomend me about size blocks, replication factor??
Thanks.
Created 02-11-2016 10:07 AM
It really depends on your scenario which block size is better. As a simple example:
Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task.
So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes.
If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster.
If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes.
If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time.
It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc.
ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is:
It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not.
For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.
Created 02-11-2016 08:55 AM
1. 128mb block
2.larger files the better
3. Minimum factor of 3
It's also good to merge these files either using HAR or rewrite into bigger files
Created 02-11-2016 10:07 AM
It really depends on your scenario which block size is better. As a simple example:
Let's assume your cluster has 50 task slots. Let's for simplicity assume that your task needs 5minutes to analyze 128MB and 1minute to set up a map task.
So if you want to analyze 1.28GB of data. You need 10 tasks which can run in the cluster in parallel. So in total your job takes 5+1 minutes = 6 minutes.
If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster.
If you have 128GB of data you need 1000 tasks at 128MB block size or 20 waves. This means you need 20 * 6 = 120 minutes.
If you have 256MB blocks you need 10 waves or 10 * 10+1minutes = 110 minutes. So your task is faster because you have less task setup time.
It all gets more complicated if you take into account Tez task reuse, compression, the type of analytics you run etc.
ORC for example already has 256MB blocks per default because it normally can skip a lot of data internally. On the other hand if you run heavy analytic tasks on smaller data (like data mining) a smaller block size might be better because your task will be heavily CPU bound and a single block could take a long time. So the answer as usually is:
It depends and you have to try it out for yourself what works in your specific scenario. 128MB is a good default but 256MB might work as well. Ir not.
For the rest: What Artem said, 3 times + replication, really small files are bad and HAR files can be a good way to adjust for them.
Created 02-11-2016 10:07 AM
Hi:
In whicj cases is better have 256MB block size?
Created 02-11-2016 10:25 AM
@Benjamin Leonhardi Nicely explained!!! Thanks!
Created 02-11-2016 06:53 PM
(part 1 of 2) Just a quick comment on replication: It is completely separate from issues of block size and file size. It has been established (there's a white paper somewhere) that replication = 3 makes it astronomically unlikely that you'll ever actually lose data within a single site due to hardware failure, as long as
Created 02-12-2016 06:06 PM
I need to find the paper sometimes. I actually had that question at a couple customers. I was sure 3x is safe but I didn't have any data to back me up.
Created 02-12-2016 07:21 PM
@Benjamin Leonhardi, please see this post: https://community.hortonworks.com/articles/16719/hdfs-data-durability-and-availability-with-replica..... Please up-vote it if you like it 🙂
Created 02-11-2016 06:54 PM
(part 2 of 2) Increasing replication to 4 or more only makes sense in DR situations, where you want, say, 2 copies at each of two sites, but then you should be using Falcon or similar to maintain the DR copies, not HDFS replication. On the other hand, replication = 2 definitely leaves your data vulnerable to "it can't happen to me" scenarios -- it only takes the wrong two disk drives to fail at the same time, and at some point they will. So decreasing replication below 3 only makes sense if it is low value data or you're using RAID or other highly reliable storage to protect your data.
Created 02-12-2016 12:43 PM
Hi:
I dont undestand why:
"If you have 256MB blocks you need 5 tasks. They will take 10+1 = 11 minutes and will be slower. So 128MB blocks are faster."