If I have a 100gig data set and the same data set is hit concurrently is one of the options to increase the replication factor to support high concurrency hits? I have long understood this to be true but can't specifically articulate clearly why? Any details would be appreciated.
The Apache HDFS documentation states To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. I would guess that increasing the number of replicas increases the chances that the replica will reside close to the reader. Probably simplistic, but a logical guess.
Yes, that's correct. Additionally, more replicas increases the likelihood of a YARN container executing on the same node local to a replica. This would cause the HDFS client to invoke short-circuit read, almost completely bypassing TCP traffic with the DataNode process. It's a trade-off though, because increasing replication factor consumes more storage, which reduces the cluster's overall "logical" capacity. As usual, this kind of tuning benefits from experimentation and benchmarking to find out what makes the most sense for a particular workload.
Not sure how the effect would be:
Normally commonly needed files like libraries are stored with a high replication level so every task easily has access to them.
However Yarn actively puts tasks on local nodes. In other words you do not have to increase the replication level to make tasks local since tasks are placed TO the data by the system. Increasing the replication level might actually reduce the ability of the Operating System to cache the data in the OS read buffer. ( I.e. if two tasks run on the same node data will be read from the Linux FS buffer and not from disc while increasing the number of copies makes this much less likely.
On the other hand you give Yarn/Tez more possibilities to place and reuse containers so that might be good.
In the end you would have to try it out to see how the effects turn out.
Your usecase is something that would be very suited for LLAP by the way. It keeps an ORC cache in memory.
Thanks @Chris Nauroth
the NameNode can tell clients like MapReduce the locations of hosts that contain a replica of the block. MapReduce uses that information in its scheduling decisions, so if there are many MapReduce jobs that all need to read the same block, then having more replicas increases the chances for all of those concurrent tasks to be scheduled to a host containing a replica (and therefore getting a short-circuit read)