What do you mean by rack awareness in HDFS?
Rack awareness is having the knowledge of Cluster topology or more specifically how the different data nodes are distributed across the racks of a Hadoop cluster. The importance of this knowledge relies on this assumption that collocated data nodes inside a specific rack will have more bandwidth and less latency whereas two data nodes in separate racks will have comparatively less bandwidth and higher latency.
The following article provides a very detailed description of Rack Awareness. https://community.hortonworks.com/articles/43057/rack-awareness-1.html
Following link provides how it can be implemented in HDP: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_configuri...
Rack awareness is the knowledge of network structure(topology) ie location of different data node across the Hadoop cluster. While reading/writing data in HDFS, Name node chooses the Data node which is in the same rack or if not available atleast in a nearby rack. This is done by maintaining Rack id of each data node by name node. This process of choosing nearby Datanodes based on Rack ID is called as Rack Awareness. By default, Hadoop assumes all Data node belongs to the same Rack.
Rack awareness is important due to below reasons :
• It ensures high data availability and reliability.
• It improves network bandwidth.
• It increases cluster performance.
• It helps to recover data if Rack failure occurs. If rack id information is known, a back node can be easily located in case of Rack failure.