Normaly, most softwares on top of HDFS reads only one replication in one time, so the memory will be charged with the real size of the data, not size*replication_factor. I think YARN service that concerned by the resource management take this role, and always the most free and fast datanode will be chosed to pull his data to memory to be treated by impalad.
Hey @Ravikiran ,
The answer is a bit nuanced.
When planning a query, Impala needs and uses the full directory and file layout information from the NameNode for all the tables used in the query. However, this information comes to the impalad indirectly.
1. All this information is initially collected by the catalogd. Catalogd queries HMS (the Hive Metastore) for the SQL schema and for the files/directories making up the SQL tables and partitions. Catalogd then queries the NameNode for the file names in these directories, and for the block layout and replication information for the files. Optionally, if you have security configured, catalogd also talks to Sentry to collect authorization information for the tables.
All this information is then replicated from the catalogd to the individual impalad daemons, because query planning happens on the impalads. When planning a query, the impalad considers where individual blocks of the HDFS files are located, because it tries to take advantage of local short-circuit reads when it plans the scans of those tables: the query planner tries to schedule scans on the data nodes containing those blocks, so that the scanner can read the blocks through the local file system.
All this information can require a lot of memory, so there are various solutions that help reducing this memory requirement. Starting from Impala 2.9.0 (available in CDH5.12.0) you can separate the roles of coordinator and executor for impala daemons: coordinators talk to clients, plan queries and keep track of parts of running queries (query fragments); executors just talk to coordinators and other executors (never to clients) and do the heavy lifting during query processing. Catalog information (SQL schema, file layout, etc) is needed only on the coordinators, so this role separation allow makes more memory available on executor nodes. It also reduces network loads, because the catalogd needs to send catalog updates only to the coordinator nodes, not to all the impalads.
This page describes coordinator/executor separation in more details: https://www.cloudera.com/documentation/enterprise/5-13-x/topics/impala_dedicated_coordinator.html
Hope this helps,