I had the cloudera manager(cms) , management service (mgmt) and the database(mariadb) running on a single VM. In order to enable HA for cms, I separated the mgmt and database to independent servers. I shutdown scm-agent on the cms server and have it running on the mgmt server. I also deployed the HAproxy (v1.8) for load balancing as advised in the documentation. I migrated the databases from the current cms server to the new DB server.
Ref# https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_cm_ha_hosts.html
The DB and NFS mounts for the management service directories are served by a single host.
After following exactly as per the doc, management roles applied starts up fine in the UI without errors, however the status is always "Unknown Health".
To the top it shows this message:
The status of the services are "?"
The logs on the management service VM has the below errors:
cloudera-scm-agent LOG Errors:
HOSTMONITOR LOG Errors:
The HAproxy has 2 IP's (all the ip's used for all the servers are accessible, no firewall of anykind involved), each IP has a DNS A record registered for the CMSserver and MGMTserver respectively.
On the management server, the file /etc/cloudera-scm-agent/config.ini has
server_host=CMSserver (the DNS name reserved on the proxy server for the cloudera manager)
listening_hostname=MGMTserver (the DNS name reserved on the proxy for the management service host)
On the mangement server, the /etc/hosts also has an entry for its local IP pointing to the MGMTserver name(the proxy host) as per the documentation.
I did a Inspect Host and the error of that is as follows:
Additionally the detailed logs on the UI cant be accessed. The below error is thrown:
I also tried dropping the amon database that was migrated from the old setup assuming some data corruption. No luck.
Running the below python script on the management service host also gives me the proxy servers DNS name as expected, which is what is used in the config.ini for listening_hostname.
python -c 'import socket; print socket.getfqdn(),socket.gethostbyname(socket.getfqdn())'
The HAproxy config is below.
frontend cmf
bind *:7180
mode tcp
option tcplog
default_backend cmf
backend cmf
server cmfhttp1 cloudera-managerA.sg.com:7180 check
server cmfhttp2 cloudera-managerB.sg.com:7180 check
frontend cmfavro
bind *:7182
mode tcp
option tcplog
default_backend cmfavro
backend cmfavro
server cmfavro1 cloudera-managerA.sg.com:7182 check
server cmfavro2 cloudera-managerB.sg.com:7182 check
frontend mgmt1
bind *:5678
mode tcp
option tcplog
backend mgmt1
server mgmt1a management-serverA.sg.com check
server mgmt1b management-serverB.sg.com check
frontend mgmt2
bind *:7184
mode tcp
option tcplog
backend mgmt2
server mgmt2a management-serverA.sg.com check
server mgmt2b management-serverB.sg.com check
frontend mgmt3
bind *:7185
mode tcp
option tcplog
backend mgmt3
server mgmt3a management-serverA.sg.com check
server mgmt3b management-serverB.sg.com check
frontend mgmt4
bind *:7186
mode tcp
option tcplog
backend mgmt4
server mgmt4a management-serverA.sg.com check
server mgmt4b management-serverB.sg.com check
frontend mgmt5
bind *:7187
mode tcp
option tcplog
backend mgmt5
server mgmt5a management-serverA.sg.com check
server mgmt5b management-serverB.sg.com check
frontend mgmt6
bind *:8083
mode tcp
option tcplog
backend mgmt6
server mgmt6a management-serverA.sg.com check
server mgmt6b management-serverB.sg.com check
frontend mgmt7
bind *:8084
mode tcp
option tcplog
backend mgmt7
server mgmt7a management-serverA.sg.com check
server mgmt7b management-serverB.sg.com check
frontend mgmt8
bind *:8086
mode tcp
option tcplog
backend mgmt8
server mgmt8a management-serverA.sg.com check
server mgmt8b management-serverB.sg.com check
frontend mgmt9
bind *:8087
mode tcp
option tcplog
backend mgmt9
server mgmt9a management-serverA.sg.com check
server mgmt9b management-serverB.sg.com check
frontend mgmt10
bind *:8091
mode tcp
option tcplog
backend mgmt10
server mgmt10a management-serverA.sg.com check
server mgmt10b management-serverB.sg.com check
frontend mgmt-agent
bind *:9000
mode tcp
option tcplog
backend mgmt-agent
server mgmt-agenta management-serverA.sg.com check
server mgmt-agentb management-serverB.sg.com check
frontend mgmt11
bind *:9994
mode tcp
option tcplog
backend mgmt11
server mgmt11a management-serverA.sg.com check
server mgmt11b management-serverB.sg.com check
frontend mgmt12
bind *:9995
mode tcp
option tcplog
backend mgmt12
server mgmt12a management-serverA.sg.com check
server mgmt12b management-serverB.sg.com check
frontend mgmt13
bind *:9996
mode tcp
option tcplog
backend mgmt13
server mgmt13a management-serverA.sg.com check
server mgmt13b management-serverB.sg.com check
frontend mgmt14
bind *:9997
mode tcp
option tcplog
backend mgmt14
server mgmt14a management-serverA.sg.com check
server mgmt14b management-serverB.sg.com check
frontend mgmt15
bind *:9998
mode tcp
option tcplog
backend mgmt15
server mgmt15a management-serverA.sg.com check
server mgmt15b management-serverB.sg.com check
frontend mgmt16
bind *:9999
mode tcp
option tcplog
backend mgmt16
server mgmt16a management-serverA.sg.com check
server mgmt16b management-serverB.sg.com check
frontend mgmt17
bind *:10101
mode tcp
option tcplog
backend mgmt17
server mgmt17a management-serverA.sg.com check
server mgmt17b management-serverB.sg.com check
Struggling from 4 days, any help would be much appreciated.
Created 05-04-2020 04:55 PM
I managed to fix this. It was a faulty haproxy config. For the management services, I was missing the default_backend. The issue has thus been resolved.
Created 04-05-2020 05:21 PM
Hello Everyone,
Can someone please provide some advises about the fix.
Thanks,
Mithun.
Created 04-05-2020 10:28 PM
On the MGMT server, I set the listening_hostname to its own hostname instead of the LB name set on the haproxy server, and it works fine. I suspect this is to do with the haproxy config however, I have done exactly as dictated in the cloudera documetation. Not sure what is missing.
Created 05-04-2020 04:55 PM
I managed to fix this. It was a faulty haproxy config. For the management services, I was missing the default_backend. The issue has thus been resolved.