Member since
01-19-2017
3676
Posts
632
Kudos Received
372
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 482 | 06-04-2025 11:36 PM | |
| 1011 | 03-23-2025 05:23 AM | |
| 536 | 03-17-2025 10:18 AM | |
| 1989 | 03-05-2025 01:34 PM | |
| 1257 | 03-03-2025 01:09 PM |
07-02-2021
03:03 AM
@mike_bronson7 Waiting for your response with the logs.
... View more
07-02-2021
02:51 AM
@dooby There is a Jira out theret see the solution https://issues.apache.org/jira/browse/SPARK-32536
... View more
07-01-2021
01:12 PM
@Faizan123 Namenode [Master] and Datanode [Slave] are part of HDFS, which is the storage layer, and ResourceManager[Master] and NodeManager [Slave] are part of YARN, which is a Resource Negotiator. So HDFS and YARN work together usually but are quite independent at design and architecture but their slave processes run together on the compute nodes i.e DataNode and a NodeManager process. This a high-level architecture of RM and NM the 2 master processes and the latter being the brain of Hadoop Below is a standard layout of a Hadoop cluster though we could have easily added a second RM for HA On the 12 compute nodes the NM and DN and co-located for the localized processing It's illogical to separate the DN and NM on different nodes. The NodeManager is YARN’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster. Updating the ResourceManager (RM) with the status of running jobs on the DN, overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management, and auxiliary services which may be exploited by different YARN applications. DataNodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized computation can be executed near the data. That's the reason DN and NM are co-located on the same VM/host. It could be very interesting to see a screenshot of the roles co-located with your data nodes. Hope that gives you a clearer picture
... View more
07-01-2021
12:12 PM
@harsh8 Any updates please let me know if you still need help Happy hadooping
... View more
07-01-2021
11:23 AM
1 Kudo
@drgenious Primo Impala shares metadata [data about data] with HMS Hive Metastore. Impala uses HDFS caching to provide performance and scalability benefits in production environments where Impala queries and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes, making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory. Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when using data from the Linux OS cache. Having said that when you restart impala you are discarding all the Cached Metadata [Location of table, permissions, query execution plans, or statistics] that makes it efficient. That explains why after the restart your queries are so slow. Impala is very efficient if it reads from data that is pinned in memory through HDFS caching. It takes advantage of the HDFS API and reads the data from memory rather than from disk whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where you specify HDFS paths. There is no better source of Impala information than Cloudera I will urge you to take time and read the below documentation to pin the option in your memory 🙂 Using HDFS Caching with Impala Configuring HDFS Caching for Impala There are 2 other options that you should think of as less expensive than restarting Impala I can't imagine you you have more than 70 data nodes INVALIDATE METADATA Is an asynchronous operation that simply discards the loaded metadata from the catalog and coordinator caches. After that operation, the catalog and all the Impala coordinators only know about the existence of databases and tables and nothing more. Metadata loading for tables is triggered by any subsequent queries. REFRESH Reloads the metadata synchronously. REFRESH is more lightweight than doing a full metadata load after a table has been invalidated. REFRESH cannot detect changes in block locations triggered by operations like HDFS balancer, hence causing remote reads during query execution with negative performance implications. The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA. INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive clients, such as SparkSQL: Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad. I hope that explains to you why and gives you options to use rather than warm start impala. If you know what table you want to query the run this before by qualify db. table name. This has saved me time with my data scientists and encapsulating them in their scripts is a good thing INVALIDATE METADATA [[db_name.]table_name] Recomputing the statistics is another solution Compute stats <table name>; COMPUTE STATS statement gathers information about the volume and distribution of data in a table and all associated columns and partitions. The information is stored in the Hive metastore database and used by Impala to help optimize queries. Hope that enlightens you.
... View more
07-01-2021
04:44 AM
@mike_bronson7 Can you share the below files in /var/log/hadoop-yarn/yarn hadoop-yarn-resourcemanager-{hostname}.log hadoop-yarn-resourcemanager-{hostname}.out Happy hadooping
... View more
06-30-2021
03:49 PM
1 Kudo
@dmharshit It's difficult to explain in 3 minutes but the capacity scheduler in YARN allows multi-tenancy of the Hadoop cluster where multiple users can share the large cluster. Every company has a private cluster cal leads to poor resource utilization. though it may provide enough resources in the cluster to meet their peak demand that peak demand may not occur that frequently, resulting in poor resource utilization at the rest of the time. Thus sharing clusters among Companys is a more cost-effective idea. However, Companys are concerned about sharing a cluster because they are worried that they may not get enough resources at the time of peak utilization. The CapacityScheduler in YARN mitigates that concern by giving each Company capacity guarantees. Capacity scheduler in YARN functionality Capacity scheduler in Hadoop works on the concept of queues. For example, each department gets its own dedicated queue with a percentage of the total cluster capacity for its own use. For example, if there are two departments sharing the cluster, one department may be given 60% of the cluster capacity and the other department is given 40%. On top of that, to provide further control and predictability on sharing of resources, the CapacityScheduler supports hierarchical queues. The company can further divide its allocated cluster capacity into separate sub-queues for a separate set of users within the department. The capacity scheduler is also flexible and allows the allocation of free resources to any queue beyond its capacity. This provides elasticity for the Companys in a cost-effective manner. When the queue to which these resources actually belong has increased demand the resources are allocated to it when those resources are released from other queues. This is a fantastic write-up YARN the Capacity Scheduler The maximum capacity is an elastic-like capacity that allows queues to make use of resources that are not being used to fill minimum capacity demand in other queues. Children Queues like in the figure above inherit the resources of their parent queue. For example, with the Preference branch, the Low leaf queue gets 20% of the Preference 20% minimum capacity while the High lead gets 80% of the 20% minimum capacity. Minimum Capacity always has to add up to 100% for all the leafs under a parent. I didn't have the opportunity tonight to but a cluster to mirror the above setup and share the capacity scheduler config to give you a better understanding.
... View more
06-30-2021
08:48 AM
@mike_bronson7 Here you go how to determine YARN and MapReduce Memory Configuration Settings Happy hadooping
... View more
06-29-2021
08:11 AM
1 Kudo
@noamsh88 check out these cloudera API's http://cloudera.github.io/cm_api/apidocs/v18/path__cm_license.html Hope that helps
... View more
06-29-2021
07:57 AM
2 Kudos
@mike_bronson7 Migration from 190 to 220 is an additional 30 DN and NM theoretically an the Resource Manager Is responsible for resource management and consists of two components: the scheduler and application manager: The scheduler allocates resources: It has extensive information about an application's resource needs, which allows it to make scheduling decisions across all applications in the cluster. Essentially, an application can ask for specific resource requests, via the YARN application master, to satisfy its resource needs. The scheduler responds to a resource request by granting a container, which satisfies the requirements laid out by the application master in the initial resource request. The application manager: Accepts job submissions, negotiating the first container to execute the application-specific application master and to restart the application master container on failure. Node managers A node manager is a per-machine or VM framework agent responsible for managing resources available on a single node. They monitors resource usage for containers and report to the scheduler within the resource manager. You can have multiple node managers just ensure you have required memory reserved for the host OS funtionlity Just like Namenodes Resource managers need to be high spec servers if you can stick to some basics like the table below your 2 RM's can handle 200 nodes with ease.Take into consideration NN and ZK,HBase memory configs. I remember running 300 data nodes/node manager in a project 2 years ago with exact setup like yours Hope that helps
... View more