Created on 07-02-201606:17 PM - edited 08-17-201911:49 AM
Rack Awareness:
Rack awareness is having the knowledge of Cluster topology
or more specifically how the different data nodes are distributed across the
racks of a Hadoop cluster. The importance of this knowledge relies on this
assumption that collocated data nodes inside a specific rack will have more
bandwidth and less latency whereas two data nodes in separate racks will have comparatively
less bandwidth and higher latency.
The main purpose of Rack awareness is:
Increasing the availability of data block
Better cluster performance
Let us assume the cluster has 9 Data Nodes with replication
factor 3.
Let us also assume that there are 3 physical racks where
these machines are placed:
Rack1: DN1;DN2;DN3
Rack2: DN4;DN5;DN6
Rack3: DN7:DN8;DN9
The following diagram depicts an example block placement
when HDFS and Yarn are not rack aware:
What happens if Rack1 goes down? ->
Potentially data in Block1 might be lost
Not being Rack aware the entire cluster is
thought of placed in default-rack
The following diagram depicts an example block placement
when HDFS and Yarn are rack aware:
What happens if Rack1 goes down? We still have
the block replicas in other data nodes
So evidently Rack awareness increases data availability. Also the HDFS balancer and decommissioning of data
nodes are rack aware operations.
What about performance?
Faster replication operation.
Since the replicas are placed within the same rack it would use higher
bandwidth and lower latency hence making it faster.
If YARN is unable to create a container in the
same data node where the queried data is located it would try to create the
container in a data node within the same rack. This would be more performant
because of the higher bandwidth and lower latency of the data nodes inside the
same rack.
Series 2:
How within few minutes you can setup Rack Awareness through
Ambari?