Member since
11-13-2019
16
Posts
0
Kudos Received
0
Solutions
06-10-2023
03:46 AM
Summary
Cloudera Manager is the best-in-class holistic interface that provides end-to-end system management and key enterprise features to provide granular visibility into and control over every part of an enterprise data hub. From time to time, you’ll need to increase the logging level if you’re troubleshooting issues within Cloudera Manager.
Change your Cloudera Manager Server logging level
The Cloudera Manager Server must be configured from the command line within Cloudera Manager Server itself.
Log in as root to your Cloudera Manager Server:
mbush@mbush-MBP16 CDSW % ssh root@<CM SERVER FQDN>
Check the Cloudera Manager Server log4j file:
[root@<CM SERVER FQDN> ~]# cat /etc/cloudera-scm-server/log4j.properties
# Copyright (c) 2012 Cloudera, Inc. All rights reserved.
#
# !!!!! IMPORTANT !!!!!
# The Cloudera Manager server finds its log file by querying log4j. It
# assumes that the first file appender in this file is the server log.
# See LogUtil.getServerLogFile() for more details.
#
# Define some default values that can be overridden by system properties
cmf.root.logger=INFO,CONSOLE
cmf.log.dir=.
cmf.log.file=cmf-server.log
cmf.perf.log.file=cmf-server-perf.log
cmf.jetty.log.file=cmf-server-nio.log
..
..
..
The key line to control the logging level of the Cloudera Manager Server is highlighted in red.
This can be done by amending the cmf.root.logger parameter and restarting the CM Server:
cmf.root.logger=DEBUG,CONSOLE
#RESTART THE CLOUDERA MANAGER SERVICE TO COMMIT THE AMENDMENT
[root@<CM SERVER FQDN> ~]# systemctl restart cloudera-scm-server
If you would like to know more about controlling further elements, refer to another very helpful Cloudera Community page - How to enable debug logging for Cloudera Manager server. This blog also goes into more detail about how you can amend the logging level at the CM Server binary level (if you are simply unable to get the CM Server to start at all).
... View more
Labels:
06-10-2023
03:41 AM
Summary
Cloudera Manager is the best-in-class holistic interface that provides end-to-end cluster management and key enterprise features to provide granular visibility into and control over every part of an open data lakehouse. The optimization steps below complement Cloudera’s Optimize the Cloudera Manager Server page.
Investigation & Resolution
Monitor your Cloudera Manager Server Heap
We have provided a useful suite of Cloudera Manager dashboards in the blog Deploy your Cloudera Manager Dashboards. The dashboard called “MB - MGMT Cluster - JVM GC Sizing” includes a set of charts that focus on the Cloudera Manager Server heap which will enable you to easily visualize abnormal characteristics of the heap. This portion shows a 6hr time series set:
Tune your Cloudera Manager Server Heap
The Cloudera Manager Server must be configured from the command line within Cloudera Manager Server itself.
Log in as root to your Cloudera Manager Server:
mbush@mbush-MBP16 CDSW % ssh root@<CM SERVER FQDN>
Check the Cloudera Manager Server config file:
[root@<CM SERVER FQDN> ~]# cat /etc/default/cloudera-scm-server
#
# Specify any command line arguments for the Cloudera SCM Server here.
#
CMF_SERVER_ARGS=""
#
# Locate the JDBC driver jar file.
#
# The default value is the default system mysql driver on RHEL/CentOS/Ubuntu
# and the standard, documented location for where to put the oracle jar in CM
# deployments.
#
export CMF_JDBC_DRIVER_JAR="/usr/share/java/mysql-connector-java.jar:/usr/share/java/oracle-connector-java.jar:/usr/share/java/postgresql-connector-java.jar"
#
# You can override JAVA_HOME here if your java is not on the normal search path
# export JAVA_HOME=/usr/java/default
#
# Java Options.
#
# Default value sets Java maximum heap size to 2GB, and Java maximum permanent
# generation size to 256MB.
#
export CMF_JAVA_OPTS="-Xmx4G -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"
The key line to control the tuning of the Cloudera Manager Server heap is highlighted in red.
Above, you can see the default settings that we get with any vanilla deployment of the Cloudera Manager service. These parameters will need to be tuned as your Cloudera CDP cluster becomes larger and busier, and that tuning is particularly important when any serious use of the Cloudera Manager API is introduced.
We recommend that you suitably raise the overall heap size based on your cluster size and CM API needs, and disable adaptive heap sizing & control the JVM heap ratio.
This can be done by amending the CMF_JAVA_OPTS and restarting the CM Server:
export CMF_JAVA_OPTS="-Xms16G -Xmx16G -XX:NewRatio=2 -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"
#RESTART THE CLOUDERA MANAGER SERVICE TO COMMIT THE AMENDMENT
[root@<CM SERVER FQDN> ~]# systemctl restart cloudera-scm-server
... View more
Labels:
06-10-2023
03:36 AM
Summary
After the upgrade from CDH to CDP, fundamental instability was observed within the Ranger Audits UI (within the Ranger Admin Service), and Infra-Solr Roles were constantly exhibiting API liveness errors.
Total audits daily count: 136,553,808
Sample screenshot of Infra-Solr Health check errors
Sample screenshot of specific Infra-Solr Server Health check errors
Investigation
Initial analysis of the number of daily audits within the Ranger service confirmed that there were as many as 1B audits per day. With only 2 Infra-Solr servers, the default configuration of a CDH to CDP upgrade needed to be tweaked to include best practices for the Infra-Solr ranger_audits collection. Reduce Ranger Audits verbosity (see this complementary document: Ranger Audit Verbosity). Assess Server Design. The count of Infra-Solr servers is important when deciding how the ranger_audits collection should be built. A single Solr Server is not recommended as it would not be resilient. Consider at least 2 replicas per shard for any collection to facilitate the split of the ranger_audits collection into 6 shards, with 2 replicas for each, while still maintaining the best practice guidelines within Solr.
Resolution
The following public documentation will assist with a deeper understanding of how you might choose to align with best practices given the hardware you have available and the volume of audits being recorded within the service - Calculating Infra Solr resource needs.
Configure the following 3 parameters within the Ranger Service (within CM) according to best practices. The example below is for a cluster of 3 Infra-Solr servers with 3 shards configured for the ranger_audits collection, with 2 replicas per shard, and limiting the number of maximum shards to 6 (which is a multiple of the 1st two parameters):
Configure the TTL (Time To Live) for audits that are propagated into the Infra-Solr ranger_audits collection. This requirement should be defined by the business, for instance, 25 days, but TTL only impacts audit visibility within the Ranger UI; all audits will remain accessible within HDFS.
Ranger - Delete ranger_audits collection
Ensure all Solr Servers are healthy and available. Then, in order to restructure it, fully delete the ranger_audits collection and monitor the status (example below).
NOTE - the example date of 8Mar2022 in the below example is for auditing purposes - it’s the date that the full collection deletion occurred.
DELETE RANGER AUDITS COLLECTION
http://<Infra-Solr-Server>:18983/solr/admin/collections?action=DELETE&name=ranger_audits&async=del_ranger_audits8Mar2022
REQUEST THE STATUS OF AN ASYNC CALL
http://<Infra-Solr-Server>:18983/solr/admin/collections?action=REQUESTSTATUS&requestid=del_ranger_audits8Mar2022
An example of a successful delete command issued to the URL:
{
"responseHeader":{
"status":0,
"QTime":9},
"requestid":"del_ranger_audits20Apr2022"}
Restart Ranger Admin service. When you perform the restart, it will recreate the ranger_audits collection based on the parameters defined earlier.
... View more
Labels:
06-10-2023
03:30 AM
Summary
Infra-Solr service exhibits fundamental stability issues after upgrading CDH to CDP.
Sample screenshot of Infra-Solr Health check errors
Sample screenshot of specific Infra-Solr Server Health check errors
Investigation
The Infra-Solr service hosts the ranger_audits collection which is used to display cluster audit information within the Ranger Admin UI. Perform preliminary analysis using Ranger Admin UI - Audits for a single day as demonstrated below. [NOTE: these sample screenshots were taken after resolving the issues; your audit counts will likely be much higher].
Total audits daily count: 136,553,808
Total Impala audits daily count: 5,901,146
Total hbaseregional audits daily count: 1,178,831
Total hbaseregional (access type scanneropen) audits daily count: 0
(due to the complete exclusion of these events)
Total hdfs audits daily count: 128,681,418
Total hdfs (access type liststatus) audits daily count: 0
(due to the complete exclusion of these events)
Assemble and analyze audit counts. The actual pre-resolution values for this case study were:
Total number of Ranger audits - 705,875,710
Application - Impala - 6,719,878
Application - hbaseRegional - 389,896,166
Application - hbaseRegional; Access Type - scannerOpen - 261,735,436
Application - hdfs - 308,644,209
Application - hdfs; Access Type - listStatus - 212,728,345
The total count of Ranger audits (700M) is excessively voluminous. Audit verbosity is a primary contributing factor to Infra-Solr service instability because Ranger Audits are stored within an Infra-Solr collection - ranger_audits, and they are presented within the Ranger Admin UI. Ranger_audits collection is overwhelming Infra-Solr Servers, leading to Web Server Status Unknown / API Liveness check failures.
To reduce audit verbosity, identify meaningful and meaningless events using the Infra-Solr API.
URL examples for reference only:
Query by date/time range
http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=evtTime:[2022-02-16T00:00:00.000Z+TO+2022-02-16T11:59:59.000Z]&sort=evtTime+desc
select all: oldest
http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+asc&rows=1000
select all: newest
http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+desc&rows=1000
Curl examples for reference only:
> Query by date/time range && number of rows to capture (important)
> -g required to disable globbing of the date range
> This is verbose
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=evtTime:[2022-02-17T00:00:00.000Z+TO+2022-02-17T11:59:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows.text
> This is the above query, but narrowing down fewer fields (that you want to see)
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Cenforcer%2Cagent%2Crepo%2CreqUser%2Cresource%2Caction&q=evtTime:[2022-02-17T00:00:00.000Z+TO+2022-02-17T11:59:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows.text
> This is the above query, but narrowing down ever fewer fields (that you want to see)
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T00:00:00.000Z+TO+2022-02-17T11:59:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows.text
> Select all: oldest
curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+asc&rows=1000" > RangerAuditSolrOutput17Feb22.text
> Select all: newest
curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+desc&rows=1000" > RangerAuditSolrOutput17Feb22.text
In this case study, 48 curl commands were executed to get a balanced picture over a 24-hour period, pulling 100,000 audit events every 30 minutes.
NOTE: The Infra-Solr server must render the output; 100,000+ events can easily crash a 30GB Infra-Solr Server. Do not pull any more for that time interval.
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T23:30:00.000Z+TO+2022-02-17T23:59:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_2330-2359.text
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T23:00:00.000Z+TO+2022-02-17T23:29:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_2300-2329.text
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T22:30:00.000Z+TO+2022-02-17T22:59:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_2230-2259.text
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T22:00:00.000Z+TO+2022-02-17T22:29:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_2200-2229.text
..
REPEAT THE COMMANDS WITH RELEVANT EXAMPLES
..
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T01:30:00.000Z+TO+2022-02-17T01:59:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_0130-0159.text
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T01:00:00.000Z+TO+2022-02-17T01:29:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_0100-0129.text
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T00:30:00.000Z+TO+2022-02-17T00:59:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_0030-0059.text
curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[2022-02-17T00:00:00.000Z+TO+2022-02-17T00:29:59.999Z]&rows=100000&sort=evtTime+desc" > RangerAuditSolrOutput17Feb22_100000Rows_0000-0029.text
The 48 output files were simply parsed to ascertain the most frequent Ranger audit access types (see the example below when creating your own):
grep access RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}'
egrep "listStatus|scannerOpen" RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}'
This example groups audit types by category to assist you in selecting what is meaningful and what is not:
grep access RangerAuditSolrOutput17Feb22_100000Rows.text | more | sort -rn | uniq -c | sort -rn
50517 "access":"listStatus",
23782 "access":"scannerOpen",
14559 "access":"get",
5193 "access":"put",
2081 "access":"open",
1884 "access":"delete",
1394 "access":"WRITE",
336 "access":"rename",
126 "access":"contentSummary",
84 "access":"checkAndPut",
26 "access":"mkdirs",
6 "access":"compactSelection",
5 "access":"flush",
4 "access":"getAclStatus",
3 "access":"compact",
In this case study, up to 1B audit events were being recorded per day with 65-70% sourcing from HDFS - listStatus and HBase - scannerOpen. Such pure metadata operations events were meaningless to DevOps, nevertheless, we verified they were also meaningless to the business before attempting to exclude them. Retain the ‘get’, ‘put’, ‘open’, ‘delete,’ and other key audits.
Assess the Infra-Solr & ranger_audits collection design – Infra-Solr server count, shards, and replicas count – which play an important role in stability. This complimentary document covers those assessment steps: Ranger - Rebuild ranger_audits).
Resolution
Tune Ranger to exclude unwanted event collection.
Edit the cm_hdfs service configuration:
Exclude the ‘listStatus’ audit type from the ‘Audit Filter’ section:
Edit the cm_hbase service configuration:
Exclude the ‘scannerOpen’ audit type from the ‘Audit Filter’ section:
Excluding unmeaningful events provided 3 benefits:
Infra-Solr and the ranger_audits collection stability was greatly improved and facilitated manageability.
Infra-Solr and ranger_audits collection required only 30-35% of the resources to perform the same tasks
Ranger audit history required only 30-35% of HDFS disk space when writing to /ranger/….
... View more
Labels:
06-10-2023
03:15 AM
Summary
It is always a good idea to review your Kudu Rebalancer settings so that all hardware is optimally utilized when Kudu Rebalancing activities are being performed.
Investigation
Kudu Configuration
Balancer configuration properties
Although the general kudu default parameters have not proven to adversely impact Kudu Rebalancing operations, the following property change is recommended to speed up that process.
Property
Default
Cloudera Chosen Value
rb_max_moves_per_server
5
10
Avoid Landmines
Some key notes before performing the rebalancing activities after setting up the services/disks:
Never run both the HDFS & Kudu Rebalancers at the same time
The contention between both may cause issues
Perform the Rebalancing activities in the order of Kudu first, HDFS second
Due to Kudu being unable to track capacity utilization
Performing Kudu Rebalancing Activities
We recommend that you perform these actions from within CM to provide full visibility into the Rebalancer status as well as when the action has started and finished.
Kudu
Go to CM - Kudu - Actions - Run Kudu Rebalancer Tool
... View more
Labels:
06-10-2023
03:10 AM
Summary
Are you having issues with more queries being handled by a single Impala Coordinator?
Does this eventually lead to OOM scenarios?
Let’s consider you have 3 Impala Coordinators within your cluster and notice that there are queries that skew onto any one of the Impala Coordinators and overwhelm it.
Note how one of the Impala Coordinators in the above example has 73 running queries, and the other 2 have relatively few.
Investigation
Source IP Persistence
To ascertain why any Impala Coordinator can skew the number of running queries that are active on it, look at the way the proxy is set up to handle incoming queries.
‘Source IP Persistence’ means setting up sessions from the same IP address to always go to the same coordinator. This setting is required when setting up high availability with Hue. It is also required to avoid the Hue message ‘results have expired’, which indicates when a query is sent to the cluster on one coordinator but the result doesn’t return to the user via the same coordinator/Hue Server.
Example HAProxy Configuration for Source IP Persistence
The public docs for setting up HAProxy for Impala - Configuring Load Balancer for Impala.
Example setup of Hue-Impala connectivity within /etc/haproxy/haproxy.cfg as follows:
listen impala-hue :21052
mode tcp
stats enable
balance source
timeout connect 5000ms
timeout queue 5000ms
timeout client 3600000ms
timeout server 3600000ms
# Impala Nodes
server impala-coordinator-001.fqdn impala-coordinator-001.fqdn:21050 check
server impala-coordinator-002.fqdn impala-coordinator-002.fqdn:21050 check
server impala-coordinator-003.fqdn impala-coordinator-003.fqdn:21050 check
Now let’s review what can impact the overall connection count into an Impala Coordinator: Hue, Hive & Impala timeout settings.
Example Timeout Settings
The following settings might mimic what you have currently set within your Hue, Hive & Impala services.
Hue
Hive
Impala
Proposed Timeout Settings
Whilst the actual settings will vary cluster by cluster, we recommend moving away from the default settings and setting all of the idle parameters to 2 hours across the board in all 3 services: Hue, Hive & Impala.
This is an initial goal of introducing timeouts whilst monitoring the user experience. The ultimate best practice in this area is to head toward having:
Idle Query Timeouts of 300 seconds (or 5 minutes)
Idle Session Timeouts of 600 seconds (or 10 minutes)
NOTE - all of the parameters being discussed relate to ‘idle’ sessions and queries; in other words, the user has to have left either the session or query in an idle state before the idle parameters will kick in. No active session or query will be captured by this change in the service(s) behavior (s).
Resolution
Hue
Steps to perform:
Go to CM - Hue - Configuration
Search for “Auto Logout Timeout”
Change to 2 hours
Restart Hue Service
Hive
Steps to perform:
Go to CM - Hive - Configuration
Search for “Idle Operation Timeout”
Change to 300 seconds
Search for “Idle Session Timeout”
Change to 600 seconds
Restart Hive Service
Hive on Tez
Steps to perform:
Go to CM - Hive on Tez - Configuration
Search for “Idle Operation Timeout”
Change to 300 seconds
Search for “Idle Session Timeout”
Change to 600 seconds
Restart Hive on Tez Service
Impala
Steps to perform:
Go to CM - Impala - Configuration
Search for “Idle Query Timeout”
Change to 300 seconds
Search for “Idle Session Timeout”
Change to 600 seconds
Restart Impala Service
... View more
06-10-2023
03:04 AM
Summary
After you experience a disk failure on a worker node then replace the disk, you’ll need to ensure that the disk is suitably rebalanced within the Kudu Service at the local level.
Investigation & Resolution
Purging a Tablet Server
There isn’t currently a method to rebalance the replicas on a single Tablet Server disk array. This means that we need to empty the node and reintroduce it so that it can be used again from scratch. We begin by quiescing the Tablet Server.
Quiesce the Tablet Server
Quiesce essentially means to stop the Tablet Server from hosting any leaders in order to:
Make other replicas on live Tablet Servers become the leaders
Prevent this Tablet Server from becoming a leader for any other reason
Allow this Tablet Server to be read from (the replicas that are still present)
Check Quiesce Status
sudo -u kudu kudu tserver quiesce status <Worker-Node-FQDN>
Quiescing | Tablet Leaders | Active Scanners
-----------+----------------+-----------------
true | 0 | 0
Quiesce Start
sudo -u kudu kudu tserver quiesce start <Worker-Node-FQDN>
Put the Tablet Server into Maintenance Mode
Maintenance Mode stops the Tablet Server from being used completely. The maintenance mode commands require you to retrieve the UUID of the Tablet Server first. We can get this information from a tserver list command:
sudo -u kudu kudu tserver list <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Note 3-FQDN>
An example that then targets the server you want to work on
sudo -u kudu kudu tserver list <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Node3-FQDN> | grep <Worker-Node-FQDN>
5e103ac84707495e843a4553ac622f20 | <Worker-Node-FQDN>:7050
Put the Tablet Server into Maintenance Mode
sudo -u kudu kudu tserver state enter_maintenance <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Note 3-FQDN> 5e103ac84707495e843a4553ac622f20
Exit the Tablet Server from Maintenance Mode
sudo -u kudu kudu tserver state exit_maintenance <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Note 3-FQDN> 5e103ac84707495e843a4553ac622f20
Run ksck to check the status of Kudu Service / TS to be purged
This will confirm the status of both Quiesce and Maintenance Mode for every Tablet Server in the cluster, (in our example - <Worker-Node-FQDN>😞
sudo -u kudu kudu cluster ksck <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Node3-FQDN> 2>&1 | tee ksck.out
The above command outputs the ksck to both the terminal and a file called ‘ksck.out’. This allows us to review the information from both perspectives and also create a record of the output in the file. But taking our example of purging <Worker-Node-FQDN> into account, the following information is key:
Tablet Server Summary
This is a list of all Tablet Servers in the cluster. We’ve focused on just <Worker-Node-FQDN> and the surrounding TS’s for illustrative purposes. Notice the text in RED - <Worker-Node-FQDN> is quiescing and has no leaders running on it.
Tablet Server Summary
UUID | Address | Status | Location | Quiescing | Tablet Leaders | Active Scanners
----------------------------------+---------------------------------+---------+-------------+-----------+----------------+-----------------
…
59e6ca5107754c24b649ee9c9acfccfb | <Worker-Node-FQDN>:7050 | HEALTHY | /CabinetE01 | false | 47 | 0
5e103ac84707495e843a4553ac622f20 | <Worker-Node-FQDN>:7050 | HEALTHY | /CabinetA08 | true | 0 | 0
5edf82f0516b4897b3a7991a7e67d71c | <Worker-Node-FQDN>:7050 | HEALTHY | /CabinetA07 | false | 1452 | 0
…
Tablet Server State (maintenance mode)
This section shows that the TS is in maintenance mode.
Tablet Server States
Server | State
----------------------------------+------------------
5e103ac84707495e843a4553ac622f20 | MAINTENANCE_MODE
Purge the Tablet Server
The following command instructs kudu to ignore the <Worker-Node-FQDN> node AND move replicas away from it:
sudo -u kudu /tmp/kudu cluster rebalance <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Note 3-FQDN> -ignored_tservers=5e103ac84707495e843a4553ac622f20 -move_replicas_from_ignored_tservers
Again, importantly, the Tablet Server has to have been successfully quiesced and put into maintenance mode to avoid any issues with the Kudu service.
A simple break in VPN or shell terminal will kill the rebalance command. This won't affect Kudu, but it will stop the process. In order to work around this and retain information during the process, use the following command to output the rebalance status into the active terminal session as well as a file:
sudo -u kudu /tmp/kudu cluster rebalance <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Node3-FQDN> -ignored_tservers=5e103ac84707495e843a4553ac622f20 -move_replicas_from_ignored_tservers 2>&1 | tee <Worker-Node-FQDN>-rebalance.out &
Re-introduce the Tablet Server
After the Kudu Tablet Server has been purged, it’s time to reintroduce it into the Kudu service so that it can be used again.
Exit the Tablet Server from Maintenance Mode
sudo -u kudu kudu tserver state exit_maintenance <Master-Node1-FQDN>,<Master-Node2-FQDN>,<Master-Note 3-FQDN> 5e103ac84707495e843a4553ac622f20
Unquiesce the Tablet Server
sudo -u kudu kudu tserver quiesce stop <Worker-Node-FQDN>
Rebalance the Kudu Service
We now have a Kudu Tablet Server that has been quiesced and purged. It’s time to rebalance the Kudu service and share the Tablets back onto the recently purged Kudu Tablet Server.
Go to CM - Kudu - Actions - Run Kudu Rebalancer Tool:
... View more
Labels:
06-10-2023
03:01 AM
Summary
When you have experienced a disk failure on a worker node and have had the disk replaced, you’ll need to ensure that the disk is suitably rebalanced within the Kudu Service at the local level.
Investigation
HDFS Disk Balancer - Explained
This is an area that already has a great Blog written around it:
How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop
Please read through the blog and follow the guidance to verify that you have already set up the HDFS service to be able to perform this necessary action.
Resolution
HDFS Disk Balancer - Execution
Let’s go through the process of performing an HDFS Intra-DataNode Disk Rebalancing process.
Obtain a local HDFS DataNode Kerberos Ticket
cd /var/run/cloudera-scm-agent/process/`ls -larth /var/run/cloudera-scm-agent/process | grep -i hdfs-DATANODE | tail -1 | awk '{print $9}'`
kinit -kt hdfs.keytab hdfs/`hostname -f`@<ClusterDomain>
Create a Disk Balancer Plan
hdfs diskbalancer -plan `hostname -f` -bandwidth 100 -thresholdPercentage 5
Example of a successful creation of a disk balancer plan:
hdfs diskbalancer -plan `hostname -f` -bandwidth 100 -thresholdPercentage 5
INFO balancer.NameNodeConnector: getBlocks calls for hdfs://nameservice1 will be rate-limited to 20 per second
INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
INFO block.BlockTokenSecretManager: Setting block keys
INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
INFO planner.GreedyPlanner: Starting plan for Node : <Worker-Node-FQDN>:9867
INFO planner.GreedyPlanner: Disk Volume set 76c137f0-5d0c-4de3-b166-5c0ac29b77d1 Type : DISK plan completed.
INFO planner.GreedyPlanner: Compute Plan for Node : <Worker-Node-FQDN>:9867 took 46 ms
INFO command.Command: Writing plan to:
INFO command.Command: /system/diskbalancer/2023-Mar-13-02-50-35/<Worker-Node-FQDN>.plan.json
Writing plan to:
/system/diskbalancer/2023-Mar-13-02-50-35/<Worker-Node-FQDN>.plan.json
Execute a Disk Balancer Plan
hdfs diskbalancer -execute /system/diskbalancer/2023-Mar-13-02-50-35/<Worker-Node-FQDN>.plan.json
Example of a successful execution of a disk balancer plan:
hdfs diskbalancer -execute /system/diskbalancer/2023-Mar-13-02-50-35/<Worker-Node-FQDN>.plan.json
INFO command.Command: Executing "execute plan" command
Query a running Disk Balancer Plan
hdfs diskbalancer -query `hostname -f`
Example of querying a running disk balancer plan:
hdfs diskbalancer -query `hostname -f`
INFO command.Command: Executing "query plan" command.
Plan File: /system/diskbalancer/2023-Mar-13-02-50-35/<Worker-Node-FQDN>.plan.json
Plan ID: 9b0d03edee9d4285cfea5fe13247d8e23cb4557d
Result: PLAN_UNDER_PROGRESS
Cancel a running Disk Balancer Plan (if required)
hdfs diskbalancer -cancel /system/diskbalancer/2023-Mar-13-02-50-35/<Worker-Node-FQDN>.plan.json
Example of cancelling a running disk balancer plan:
hdfs diskbalancer -cancel /system/diskbalancer/2023-Mar-13-02-50-35/<Worker-Node-FQDN>.plan.json
INFO command.Command: Executing "Cancel plan" command.
HDFS Disk Balancer - No Rebalancing Required Example
The following example is what you will see if you attempt to run the HDFS local disk balancer on a node that doesn’t require any rebalancing to occur:
hdfs diskbalancer -plan `hostname -f` -bandwidth 100 -thresholdPercentage 5
INFO balancer.NameNodeConnector: getBlocks calls for hdfs://nameservice1 will be rate-limited to 20 per second
INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
INFO block.BlockTokenSecretManager: Setting block keys
INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
INFO planner.GreedyPlanner: Starting plan for Node : <Worker-Node-FQDN>:9867
INFO planner.GreedyPlanner: Compute Plan for Node : <Worker-Node-FQDN>:9867 took 36 ms
INFO command.Command: No plan generated. DiskBalancing not needed for node: <Worker-Node-FQDN> threshold used: 5.0
No plan generated. DiskBalancing not needed for node: <Worker-Node-FQDN> threshold used: 5.0
... View more
Labels:
06-10-2023
02:59 AM
Summary
It is expected that you will experience worker node data disk failures whilst managing your CDP cluster. This blog takes you through the steps that you should take to gracefully replace the failed worker node disks with the least disruption to your CDP cluster.
Investigation
Cloudera Manager Notification
One easy method to identify that you have experienced a disk failure within your cluster is with the Cloudera Manager UI. You will see the following type of error:
Cloudera Manager will also track multiple disk failures:
HDFS NameNode - DataNode Volume Failures
The failed disks within your cluster can also be observed from within the HDFS NameNode UI:
This is also useful to quickly identify exactly which storage locations have failed.
Confirming from the Command Line
Taking the last example from HDFS NameNode - DataNode Volume Failures, we can see that /data/20 & /data/6 are both failed directories.
The following interaction from the Command Line on the worker node will also confirm the disk issue:
[root@<WorkerNode> ~]# ls -larth /data/20
ls: cannot access /data/20: Input/output error
[root@<WorkerNode> ~]# ls -larth /data/6
ls: cannot access /data/6: Input/output error
[root@<WorkerNode> ~]# ls -larth /data/1
total 0
drwxr-xr-x. 26 root root 237 Sep 30 02:54 ..
drwxr-xr-x. 3 root root 20 Oct 1 06:45 kudu
drwxr-xr-x. 3 root root 16 Oct 1 06:46 dfs
drwxr-xr-x. 3 root root 16 Oct 1 06:47 yarn
drwxr-xr-x. 3 root root 29 Oct 1 06:48 impala
drwxr-xr-x. 2 impala impala 6 Oct 1 06:48 cores
drwxr-xr-x. 7 root root 68 Oct 1 06:48 .
Resolution
Replace a disk on a Worker Node
You will have a number of roles that are running on any single worker node host. This is an example of a worker node that is showing a failed disk:
Decommission the Worker Node
As there are multiple roles running on a worker node, it’s best to use the decommissioning process to gracefully remove the worker node from running services. This can be found by navigating to the host within Cloudera Manager and using “Actions > Begin Maintenance”
It will then take you to the following page:
Click “Begin Maintenance” and wait for the process to complete.
Expect this process to take hours on a busy cluster. The time the process takes to complete is dependent upon:
The number of regions that the HBase RegionServer is hosting
The number of blocks that the HDFS DataNode is hosting
The number of tablets that the Kudu TabletServer is hosting
Replace and Configure the disks
Once the worker node is fully decommissioned, the disks are ready to be replaced and configured physically within your datacenter by your infrastructure team.
Every cluster is going to have its own internal processes to configure the newly replaced disks. Let’s go through an example of how this work can be verified for reference.
List the attached block devices
[root@<WorkerNode> ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3.7T 0 disk /data/1
sdb 8:16 0 3.7T 0 disk /data/2
sdc 8:32 0 3.7T 0 disk /data/3
sdd 8:48 0 3.7T 0 disk /data/4
sde 8:64 0 3.7T 0 disk /data/5
sdf 8:80 0 3.7T 0 disk /data/6
sdg 8:96 0 3.7T 0 disk /data/7
sdh 8:112 0 3.7T 0 disk /data/8
sdi 8:128 0 3.7T 0 disk /data/9
sdj 8:144 0 3.7T 0 disk /data/10
sdk 8:160 0 3.7T 0 disk /data/11
sdl 8:176 0 3.7T 0 disk /data/12
sdm 8:192 0 3.7T 0 disk /data/13
sdn 8:208 0 3.7T 0 disk /data/14
sdo 8:224 0 3.7T 0 disk /data/15
sdp 8:240 0 3.7T 0 disk /data/16
sdq 65:0 0 3.7T 0 disk /data/17
sdr 65:16 0 3.7T 0 disk /data/18
sds 65:32 0 3.7T 0 disk /data/19
sdt 65:48 0 3.7T 0 disk /data/20
sdu 65:64 0 3.7T 0 disk /data/21
sdv 65:80 0 3.7T 0 disk /data/22
sdw 65:96 0 3.7T 0 disk /data/23
sdx 65:112 0 3.7T 0 disk /data/24
sdy 65:128 0 1.8T 0 disk
├─sdy1 65:129 0 1G 0 part /boot
├─sdy2 65:130 0 20G 0 part [SWAP]
└─sdy3 65:131 0 1.7T 0 part
├─vg01-root 253:0 0 500G 0 lvm /
├─vg01-kuduwal 253:1 0 100G 0 lvm /kuduwal
├─vg01-home 253:2 0 50G 0 lvm /home
└─vg01-var 253:3 0 100G 0 lvm /var
List the IDs of the block devices
[root@<WorkerNode> ~]# blkid
/dev/sdy1: UUID="4b2f1296-460c-4cbc-8aca-923c9309d4fe" TYPE="xfs"
/dev/sdy2: UUID="af9c4c79-21b9-4d02-9453-ede88b920c1f" TYPE="swap"
/dev/sdy3: UUID="j9n4QD-60xB-rqpQ-Ck3y-s2m0-FdSo-IGWrN9" TYPE="LVM2_member"
/dev/sdb: UUID="4865e719-e77c-4d1e-b1e0-80ae1d0d6e82" TYPE="xfs"
/dev/sdc: UUID="59ae0b91-3cfc-4c53-a02f-e20bdf0ac209" TYPE="xfs"
/dev/sdd: UUID="b80473e0-bce8-413c-9740-934e8ed7006e" TYPE="xfs"
/dev/sda: UUID="684e32c8-eeb2-4215-b861-880543b1f96b" TYPE="xfs"
/dev/sdg: UUID="0f0d12ac-7d93-4c76-9f5c-ac6b43f2eaff" TYPE="xfs"
/dev/sde: UUID="06c0e908-dd67-4a42-8615-7b7335a7e0f6" TYPE="xfs"
/dev/sdf: UUID="9346fa04-dc1a-4dcc-8233-a5cb65495998" TYPE="xfs"
/dev/sdn: UUID="8f05d1dd-94d1-4376-9409-d5683ad4c225" TYPE="xfs"
/dev/sdo: UUID="5e0413d1-0b82-4ec1-b3f9-bb072db39071" TYPE="xfs"
/dev/sdh: UUID="08063201-f252-49dd-8402-042afbea78a2" TYPE="xfs"
/dev/sdl: UUID="1e5ace85-f93c-46f7-bf65-353f774cfeaa" TYPE="xfs"
/dev/sdk: UUID="195967b5-a1a0-43bb-9a33-9cf7a36fdcb6" TYPE="xfs"
/dev/sdq: UUID="db81b056-587e-47a6-844e-2d952278324b" TYPE="xfs"
/dev/sdr: UUID="45b4cf68-6f10-4dc7-8128-c2006e7aba5d" TYPE="xfs"
/dev/sds: UUID="a8e591e9-33c8-478a-b580-aeac9ad4cf44" TYPE="xfs"
/dev/sdi: UUID="a0187ae0-7598-44c4-805c-ef253dea6e7a" TYPE="xfs"
/dev/sdm: UUID="720836d8-ddd6-406d-a33f-f1b92f9b40d5" TYPE="xfs"
/dev/sdv: UUID="df4bdd58-e8d2-4bdb-8255-b9c7fcfe8999" TYPE="xfs"
/dev/sdw: UUID="701f3516-03bc-461b-930c-ab34d0b417d7" TYPE="xfs"
/dev/sdu: UUID="5e1bd2f3-8ccc-4ba1-a0f7-bb55c8246d72" TYPE="xfs"
/dev/sdj: UUID="264b85f8-9740-418b-a811-20666a305caa" TYPE="xfs"
/dev/sdt: UUID="53f2f06e-71e9-4796-86a3-2212c0f652ea" TYPE="xfs"
/dev/sdp: UUID="e6b984c0-6d85-4df2-9a7d-cc1c87238c49" TYPE="xfs"
/dev/mapper/vg01-root: UUID="18bc42fe-dbfd-4005-8e13-6f5d2272d9a7" TYPE="xfs"
/dev/sdx: UUID="53e4023f-583a-4219-bfd2-1a94e15f34ef" TYPE="xfs"
/dev/mapper/vg01-kuduwal: UUID="a1441e2f-718b-42eb-b398-28ce20ee50ad" TYPE="xfs"
/dev/mapper/vg01-home: UUID="fbc8e522-64da-4cc3-87b6-89ea83fb0aa0" TYPE="xfs"
/dev/mapper/vg01-var: UUID="93b1537f-a1a9-4616-b79a-cab9a1e39bf1" TYPE="xfs"
View the /etc/fstab
[root@<WorkerNode> ~]# cat /etc/fstab
/dev/mapper/vg01-root / xfs defaults 0 0
UUID=4b2f1296-460c-4cbc-8aca-923c9309d4fe /boot xfs defaults 0 0
/dev/mapper/vg01-home /home xfs defaults 0 0
/dev/mapper/vg01-kuduwal /kuduwal xfs defaults 0 0
/dev/mapper/vg01-var /var xfs defaults 0 0
UUID=af9c4c79-21b9-4d02-9453-ede88b920c1f swap swap defaults 0 0
UUID=684e32c8-eeb2-4215-b861-880543b1f96b /data/1 xfs noatime,nodiratime 0 0
UUID=4865e719-e77c-4d1e-b1e0-80ae1d0d6e82 /data/2 xfs noatime,nodiratime 0 0
UUID=59ae0b91-3cfc-4c53-a02f-e20bdf0ac209 /data/3 xfs noatime,nodiratime 0 0
UUID=b80473e0-bce8-413c-9740-934e8ed7006e /data/4 xfs noatime,nodiratime 0 0
UUID=06c0e908-dd67-4a42-8615-7b7335a7e0f6 /data/5 xfs noatime,nodiratime 0 0
UUID=9346fa04-dc1a-4dcc-8233-a5cb65495998 /data/6 xfs noatime,nodiratime 0 0
UUID=0f0d12ac-7d93-4c76-9f5c-ac6b43f2eaff /data/7 xfs noatime,nodiratime 0 0
UUID=08063201-f252-49dd-8402-042afbea78a2 /data/8 xfs noatime,nodiratime 0 0
UUID=a0187ae0-7598-44c4-805c-ef253dea6e7a /data/9 xfs noatime,nodiratime 0 0
UUID=264b85f8-9740-418b-a811-20666a305caa /data/10 xfs noatime,nodiratime 0 0
UUID=195967b5-a1a0-43bb-9a33-9cf7a36fdcb6 /data/11 xfs noatime,nodiratime 0 0
UUID=1e5ace85-f93c-46f7-bf65-353f774cfeaa /data/12 xfs noatime,nodiratime 0 0
UUID=720836d8-ddd6-406d-a33f-f1b92f9b40d5 /data/13 xfs noatime,nodiratime 0 0
UUID=8f05d1dd-94d1-4376-9409-d5683ad4c225 /data/14 xfs noatime,nodiratime 0 0
UUID=5e0413d1-0b82-4ec1-b3f9-bb072db39071 /data/15 xfs noatime,nodiratime 0 0
UUID=e6b984c0-6d85-4df2-9a7d-cc1c87238c49 /data/16 xfs noatime,nodiratime 0 0
UUID=db81b056-587e-47a6-844e-2d952278324b /data/17 xfs noatime,nodiratime 0 0
UUID=45b4cf68-6f10-4dc7-8128-c2006e7aba5d /data/18 xfs noatime,nodiratime 0 0
UUID=a8e591e9-33c8-478a-b580-aeac9ad4cf44 /data/19 xfs noatime,nodiratime 0 0
UUID=53f2f06e-71e9-4796-86a3-2212c0f652ea /data/20 xfs noatime,nodiratime 0 0
UUID=5e1bd2f3-8ccc-4ba1-a0f7-bb55c8246d72 /data/21 xfs noatime,nodiratime 0 0
UUID=df4bdd58-e8d2-4bdb-8255-b9c7fcfe8999 /data/22 xfs noatime,nodiratime 0 0
UUID=701f3516-03bc-461b-930c-ab34d0b417d7 /data/23 xfs noatime,nodiratime 0 0
UUID=53e4023f-583a-4219-bfd2-1a94e15f34ef /data/24 xfs noatime,nodiratime 0 0
Recommission the Worker Node
Once the disk(s) has been suitably replaced, it’s time to use the recommissioning process to gracefully reintroduce the worker node back into the cluster. This can be found by navigating to the host within Cloudera Manager and using “Actions > End Maintenance”
After the node has completed its recommission cycle, follow the guidance in the next sections to perform local disk rebalancing where appropriate.
Address local disk HDFS Balancing
Most clusters utilize HDFS. This service has a local disk balancer that you can make use of. Please find some helpful guidance within the following - Rebalance your HDFS Disks (single node)
Address local disk Kudu Balancing
If you are running Kudu within your cluster, you will need to rebalance the existing Kudu data on the local disks of the worker node. Please find some helpful guidance within the following - Rebalance your Kudu Disks (single node)
... View more
Labels:
06-10-2023
02:56 AM
Summary
Within the blog Rebalance your mixed HDFS & Kudu Services, we demonstrated how to properly review and set up a mixed HDFS / Kudu shared services cluster.
Now it is time to review a method that allows you to confirm the distribution of data of your HDFS & Kudu services at the disk level of each and every worker node.
Investigation
Commands to check the balance of HDFS & Kudu
Log in as root to each worker node that is part of the HDFS and Kudu service, and perform the following commands.
Check overall disk capacity status:
[root@<Worker-Node> ~]# df -h /data/* | sed 1d | sort
/dev/sdb 1.9T 1.4T 478G 75% /data/1
/dev/sdc 1.9T 1.3T 560G 70% /data/2
/dev/sdd 1.9T 1.4T 513G 73% /data/3
/dev/sde 1.9T 1.4T 489G 74% /data/4
/dev/sdf 1.9T 1.4T 464G 76% /data/5
/dev/sdg 1.9T 1.4T 513G 73% /data/6
/dev/sdh 1.9T 1.4T 525G 72% /data/7
/dev/sdi 1.9T 1.4T 466G 76% /data/8
/dev/sdj 1.9T 1.3T 538G 72% /data/9
/dev/sdk 1.9T 1.5T 418G 78% /data/10
/dev/sdl 1.9T 1.3T 617G 67% /data/11
/dev/sdm 1.9T 1.3T 572G 70% /data/12
/dev/sdn 1.9T 1.4T 474G 75% /data/13
/dev/sdo 1.9T 1.3T 534G 72% /data/14
/dev/sdp 1.9T 1.4T 468G 75% /data/15
/dev/sdq 1.9T 1.4T 470G 75% /data/16
/dev/sdr 1.9T 1.4T 466G 75% /data/17
/dev/sds 1.9T 1.4T 468G 75% /data/18
/dev/sdt 1.9T 1.4T 473G 75% /data/19
/dev/sdu 1.9T 1.4T 474G 75% /data/20
/dev/sdv 1.9T 1.4T 467G 75% /data/21
/dev/sdw 1.9T 1.4T 474G 75% /data/22
/dev/sdx 1.9T 1.4T 473G 75% /data/23
/dev/sdy 1.9T 1.4T 477G 75% /data/24
Check overall HDFS disk capacity status:
[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/dfs | sort -t/ -k3,3n
606G /data/1/dfs
612G /data/2/dfs
608G /data/3/dfs
609G /data/4/dfs
610G /data/5/dfs
619G /data/6/dfs
613G /data/7/dfs
634G /data/8/dfs
590G /data/9/dfs
681G /data/10/dfs
618G /data/11/dfs
621G /data/12/dfs
1.2T /data/13/dfs
1.1T /data/14/dfs
1.2T /data/15/dfs
1.2T /data/16/dfs
1.2T /data/17/dfs
1.2T /data/18/dfs
1.2T /data/19/dfs
1.2T /data/20/dfs
1.2T /data/21/dfs
1.2T /data/22/dfs
1.2T /data/23/dfs
1.2T /data/24/dfs
Check overall Kudu disk capacity status:
[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/kudu | sort -t/ -k3,3n
745G /data/1/kudu
691G /data/2/kudu
741G /data/3/kudu
765G /data/4/kudu
788G /data/5/kudu
730G /data/6/kudu
725G /data/7/kudu
763G /data/8/kudu
734G /data/9/kudu
768G /data/10/kudu
628G /data/11/kudu
669G /data/12/kudu
205G /data/13/kudu
204G /data/14/kudu
205G /data/15/kudu
208G /data/16/kudu
209G /data/17/kudu
205G /data/18/kudu
204G /data/19/kudu
204G /data/20/kudu
206G /data/21/kudu
203G /data/22/kudu
194G /data/23/kudu
200G /data/24/kudu
Now collate all of the information retrieved from the Worker Node into an easy to read format so that it is easy for you to observe out of sync characteristics at the Worker Node layer.
Worker Node Balance Example
Taking the output from Commands to check the balance of HDFS & Kudu, here is an example of how you might collate the information into a format that makes it easy to notice data balance issues at the Worker Node level.
Worker-Node
du -h
du -h
Disk
Size Used Avail Use%
dfs
kudu
/data/1
1.9T 1.7T 164G 92%
414G
1.3T
/data/2
1.9T 1.5T 395G 79%
499G
970G
/data/3
1.9T 1.5T 351G 82%
487G
1022G
/data/4
1.9T 1.5T 338G 82%
493G
1.1T
/data/5
1.9T 1.5T 352G 82%
486G
1.1T
/data/6
1.9T 1.5T 337G 82%
498G
1.1T
/data/7
1.9T 1.5T 337G 82%
485G
1.1T
/data/8
1.9T 1.5T 350G 82%
494G
1018G
/data/9
1.9T 1.5T 339G 82%
475G
1.1T
/data/10
1.9T 1.5T 391G 80%
487G
985G
/data/11
1.9T 1.5T 338G 82%
487G
1.1T
/data/12
1.9T 1.6T 320G 83%
475G
1.1T
/data/13
1.9T 1.2T 688G 64%
1.2T
353M
/data/14
1.9T 1.2T 679G 64%
1.2T
8.5G
/data/15
1.9T 1.2T 674G 64%
1.2T
13G
/data/16
1.9T 1.2T 678G 64%
1.2T
8.0G
/data/17
1.9T 1.2T 686G 64%
1.2T
8.0K
/data/18
1.9T 1.2T 680G 64%
1.2T
5.4G
/data/19
1.9T 1.2T 694G 63%
1.2T
33M
/data/20
1.9T 1.2T 688G 64%
1.2T
8.0K
/data/21
1.9T 1.2T 689G 64%
1.2T
8.0K
/data/22
1.9T 1.2T 686G 64%
1.2T
129M
/data/23
1.9T 1.2T 679G 64%
1.2T
7.4G
/data/24
1.9T 1.2T 684G 64%
1.2T
33M
If you have set up the mixed HDFS & Kudu configuration sometime after it was deployed, you are likely to encounter node level disk capacity issues with the Kudu Rebalance command.
This is due to how the Kudu Rebalancer, currently, is unaware of total disk capacity, or current used disk capacity.
Analyze the Data Distribution
Within the Worker Node Balance Example, we can see that we have 24 disks, all of them highlighting that there is a fundamental imbalance between them.
All 24 disks in the example are configured within HDFS and Kudu, but the HDFS & Kudu configuration alignment happened after the cluster had been used for many years.
Note how out of sync they are:
Disks 1-12 are far more utilized than Disks 13-24. This can happen due to:
Adding an extra 12 disks to the node at some point
Configuration of the Kudu Tablet Server Role Group performed later than the node was deployed into HDFS / Kudu
Disk 1 is at 92%
If left unchecked, every service in the cluster that uses the data disks will be affected when this disk reaches 100%
Depending on how you monitor disk level utilization at a per node level, the overall capacity of the node will not reflect that this single disk is nearly fully utilized.
There are other scenarios that can cause a similar imbalance. Failed disks being replaced and then the HDFS and Kudu Rebalancing activities remain focused only at the service level
Resolution
Whether it is down to a later commitment or alignment of HDFS or Kudu configuration, which is focused on the disk distribution, or it’s just a cluster that has had countless disks replaced over time and had 0 local disk balancing methods applied afterward - it’s time for us to illustrate how to handle these issues.
There are several blogs that can help you with this:
Replace your failed Worker Node disks
Rebalance your HDFS Disks (single node)
Rebalance your Kudu Disks (single node)
... View more
Labels: