Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎04-25-2019

Impala - Table block loading in memory

[ Edited ]

Hi,

 
When query execute in Impala, Namenode send details of files and blocks to Impalad deamon and load it in memory.
 
May I know does namenode send details of all 3  HDFS replica's of table [ i.e. default of 3 HDFS replications] to Impalad? If not which HDFS replica will be send to Impala for processing?
 
Regards
Expert Contributor
Posts: 136
Registered: ‎07-17-2017

Re: Impala - Table block loading in memory

Hi @Ravikiran

Normaly, most softwares on top of HDFS reads only one replication in one time, so the memory will be charged with the real size of the data, not size*replication_factor. I think YARN service that concerned by the resource management take this role, and always the most free and fast datanode will be chosed to pull his data to memory to be treated by impalad.

Good luck.

Cloudera Employee
Posts: 4
Registered: ‎11-03-2016

Re: Impala - Table block loading in memory

[ Edited ]

Hey @Ravikiran ,

 

The answer is a bit nuanced.

When planning a query, Impala needs and uses the full directory and file layout information from the NameNode for all the tables used in the query. However, this information comes to the impalad indirectly.

1. All this information is initially collected by the catalogd. Catalogd queries HMS (the Hive Metastore) for the SQL schema and for the files/directories making up the SQL tables and partitions. Catalogd then queries the NameNode for the file names in these directories, and for the block layout and replication information for the files. Optionally, if you have security configured, catalogd also talks to Sentry to collect authorization information for the tables.

All this information is then replicated from the catalogd to the individual impalad daemons, because query planning happens on the impalads. When planning a query, the impalad considers where individual blocks of the HDFS files are located, because it tries to take advantage of local short-circuit reads when it plans the scans of those tables: the query planner tries to schedule scans on the data nodes containing those blocks, so that the scanner can read the blocks through the local file system.

 

All this information can require a lot of memory, so there are various solutions that help reducing this memory requirement. Starting from Impala 2.9.0 (available in CDH5.12.0) you can separate the roles of coordinator and executor for impala daemons: coordinators talk to clients, plan queries and keep track of parts of running queries (query fragments); executors just talk to coordinators and other executors (never to clients) and do the heavy lifting during query processing. Catalog information (SQL schema, file layout, etc) is needed only on the coordinators, so this role separation allow makes more memory available on executor nodes. It also reduces network loads, because the catalogd needs to send catalog updates only to the coordinator nodes, not to all the impalads.

 

This page describes coordinator/executor separation in more details: https://www.cloudera.com/documentation/enterprise/5-13-x/topics/impala_dedicated_coordinator.html

 

Hope this helps,

 

   - Laszlo