Created 08-30-2017 04:15 AM
Created 08-30-2017 12:28 PM
Data locality means moving computation rather than moving data to save the bandwidth.
This minimizes network congestion and increases the overall throughput of the system.
Created 08-30-2017 01:17 PM
As mentioned by @Laiba Khan data locality refers to moving compute to data which is typically faster than moving data to compute.
In Hadoop, data is divided into blocks and distributed across multiple servers (nodes). Additionally, it is replicated (typically 3 copies total) across these nodes. Thus, subsets of a dataset are distributed across nodes. When a map-reduce or Tez job is started, a container with code is distributed across the cluster nodes. These containers operate on the data in parallel and usually grab data blocks that are stored on the same node, thus achieving parallel processing with data locality. This results in fast overall execution of the full data set distributed across multiple nodes. This is key to operating on large volumes of data ... parallel processing is one component and processing data stored locally is another. Processing data that has to move across the network (no data locality) is slower.
Note that in cloud computing it is often advantageous NOT to have data locality. Local disks in the cloud are ephemeral ... if the (virtual) server is destroyed all data sitting on it are destroyed. Thus, putting data on local disks means you lose it when you spin down a cluster. One of the advantages to cloud is paying for servers only when you use them. Thus it is common to have scenarios when you spin up a cluster, do some processing and then spin it down (e.g. running a report or training a model in data science). In this scenario you would want your data stored on non-local data like AWS S3 object storage which is very inexpensive. This data persists separately from your cluster so only your compute is ephemeral. When you spin up a cluster, it reads from the permanent non-local storage and perhaps writes to it. You lose data locality but you gain the ability to pay for your cluster only when you use it. Compute on non-local data in this scenario is slower than local but not extremely so, especially when you scale out your cluster (more nodes) to increase the parallel processing aspect.