Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

When performing a Rolling or Express Upgrade, failures can naturally happen because large clusters are bound to have problematic hosts.

Here are 10 easy tips to prevent, diagnose and fix errors.

Before upgrading the stack ...

1. Always upgrade Ambari to the most recent version, even if it's a dot release.

Often, there are fixes and optimizations that make the stack upgrade smoother.

2. Ensure all services are up, service checks are passing, there are no critical alerts, etc.

This helps ensure that the cluster is fully operational and helps to isolate any failures.

3. Pre-Install the bits and make sure all hosts have enough disk space. You can check that the version is found on all hosts. E.g.,

hdp-select versions | grep 2.5.0.0 | sort | tail -1

4. Do not ignore warnings. Starting in Ambari 2.2.2, there's a flag in ambari.properties file that allows users to bypass PreCheck errors, make sure it is either not present or set to false,

stack.upgrade.bypass.prechecks=false

5. Take a backup of the Ambari database. E.g.,

pg_dump -U ambari ambari > /tmp/ambari_bk.psql
mysqldump -u ambari ambari > /tmp/ambari_bk.mysql

In the middle of Upgrade ...

6. Rolling Upgrade will pause after 30% of the DataNodes have been upgraded. This allows the customer to run additional jobs and ensure that the partial upgrade is still healthy.

7. If a failure occurs, click on "Retry" and make sure that all other dependent services and masters are up.

Often, a retry will work if the previous command failed due to a timeout, network glitch, host goes down and then comes back up, etc. Capture any logs from both the component that failed and the ambari-agent at /var/lib/ambari-agent/data/output-*.txt and /var/lib/ambari-agent/data/errors-*.txt

8. If the failure requires changing configs or restarting a component on a host, then click on the "Pause" button. This will temporarily suspend the Upgrade/Downgrade and allow the user to change configs, execute other commands, such as restarting services, running service checks, etc. Once done, click on the "Resume" button.

CAUTION: do not ever add or move hosts, add or delete services, enable High Availability, or change topology while the upgrade is in progress.

If cannot Finalize ...

9. Find out the problematic hosts and components. In Ambari 2.0 - 2.2, you can run

SELECT repo_version_id, version, display_name FROM repo_version;


-- The state for your version may be in UPGRADING, UPGRADED.-- UPGRADING: some component on a host is still not on the newer version
-- UPGRADED: all components on all hosts are on the newer version
SELECT version, state FROM cluster_version cv JOIN repo_version rv ON cv.repo_version_id = rv.repo_version_id ORDER BY version DESC;


-- Find how many hosts are in each state
SELECT version, state, COUNT(*) FROM host_version hv JOIN repo_version rv ON hv.repo_version_id = rv.repo_version_id GROUP BY version, state ORDER BY version DESC, state;


-- Find components on hosts still not on the newer version
SELECT service_name, component_name, version, host_name FROM hostcomponentstate hcs JOIN hosts h ON hcs.host_id = h.host_id WHERE service_name NOT IN ('AMBARI_METRICS', 'KERBEROS') and component_name NOT IN ('ZKFC') ORDER BY version, service_name, component_name, host_name;

On these hosts, run the following,

1. hdp-select set all <new_version>

2. Restart any components still on the older version (you may have to click on the "Pause" button first).

Once all hosts are on the newer version, then the Cluster Version status should transition to UPGRADED; this will allow you to Finalize the upgrade.

10. If you still run into problems, gather all of the logs, result of the SQL queries, and either email Hortonworks Support or the mailing list of the component it failed on.

Here's another useful query.

Postgres:
SELECT u.upgrade_id, u.direction, u.from_version, u.to_version, hrc.request_id, hrc.task_id, substr(g.group_title, 0, 30), substr(i.item_text, 0, 30), hrc.status
FROM upgrade_group g JOIN upgrade u ON g.upgrade_id = u.upgrade_id  
JOIN upgrade_item i ON i.upgrade_group_id = g.upgrade_group_id  
JOIN host_role_command hrc ON hrc.stage_id = i.stage_id AND hrc.request_id = u.request_id 
ORDER BY hrc.task_id;


MySQL:
SELECT u.upgrade_id, u.direction, u.from_version, u.to_version, hrc.request_id, hrc.task_id, left(g.group_title, 30), left(i.item_text, 30), hrc.status
FROM upgrade_group g JOIN upgrade u ON g.upgrade_id = u.upgrade_id  
JOIN upgrade_item i ON i.upgrade_group_id = g.upgrade_group_id  
JOIN host_role_command hrc ON hrc.stage_id = i.stage_id AND hrc.request_id = u.request_id 
ORDER BY hrc.task_id;

Have fun upgrading.

2,888 Views