Support Questions

Find answers, ask questions, and share your expertise

Cloudera Management Service Unknown Health after separating Management Service and DB to independent server

avatar
Explorer

I had the cloudera manager(cms) , management service (mgmt) and the database(mariadb) running on a single VM. In order to enable HA for cms, I separated the mgmt and database to independent servers. I shutdown scm-agent on the cms server and have it running on the mgmt server. I also deployed the HAproxy (v1.8) for load balancing as advised in the documentation. I migrated the databases from the current cms server to the new DB server.

Ref# https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_cm_ha_hosts.html

 

The DB and NFS mounts for the management service directories are served by a single host.

After following exactly as per the doc, management roles applied starts up fine in the UI without errors, however the status is always "Unknown Health".

To the top it shows this message:

Mithun119_0-1585661261929.png

The status of the services are "?"

Mithun119_1-1585661892197.png

The logs on the management service VM has the below errors:

cloudera-scm-agent LOG Errors:

 

Spoiler
[30/Mar/2020 16:16:35 +0000] 10723 DnsResolutionMonitor throttling_logger WARNING hostname sgsg2s214 differs from the canonical name cloudera-sg-mgmt-it-simulation.sg.flowtraders.local
[30/Mar/2020 16:17:05 +0000] 10723 MonitorDaemon-Reporter firehoses INFO Creating a connection to the ACTIVITYMONITOR.
[30/Mar/2020 16:17:05 +0000] 10723 MonitorDaemon-Reporter firehoses INFO Creating a connection to the SERVICEMONITOR.
[30/Mar/2020 16:17:05 +0000] 10723 MonitorDaemon-Reporter firehoses INFO Creating a connection to the HOSTMONITOR.
[30/Mar/2020 16:17:05 +0000] 10723 MonitorDaemon-Reporter throttling_logger ERROR Error sending messages to firehose: mgmt-HOSTMONITOR-ea8a1f6943cbb1b40cd5fd8bdb7e1e51
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/firehose.py", line 121, in _send
self._port)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 469, in __init__
self.conn.connect()
File "/usr/lib64/python2.7/httplib.py", line 824, in connect
self.timeout, self.source_address)
File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
raise err
error: [Errno 111] Connection refused

HOSTMONITOR LOG Errors:

 

 

Spoiler
2020-03-31 20:55:55,805 INFO com.cloudera.cmon.tstore.leveldb.LDBPartitionManager: Opening partition LDBPartitionMetadataWrapper{tableName=ts_subject, partitionName=ts_subject_2020-03-30T13:52:40.108Z, startTime=2020-03-30T13:52:40.108Z, endTime=null, version=9, state=CLOSED}
2020-03-31 20:55:55,981 WARN com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher: Failed to send messages to SMON.
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy23.writeStatusRecords(Unknown Source)
at com.cloudera.cmon.firehose.BasicFirehoseClient.writeStatusRecords(BasicFirehoseClient.java:75)
at com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher.processRecords(HMONToSMONHostSubjectRecordPublisher.java:107)
at com.cloudera.cmon.tstore.leveldb.LDBSubjectRecordStore.write(LDBSubjectRecordStore.java:399)
at com.cloudera.cmon.kaiser.HMONTestRunner.runHostTestsForSession(HMONTestRunner.java:86)
at com.cloudera.cmon.kaiser.HMONTestRunner.runTestsForSession(HMONTestRunner.java:66)
at com.cloudera.cmon.kaiser.BaseTestRunner.runTestsOnAllSubjects(BaseTestRunner.java:143)
at com.cloudera.cmon.kaiser.KaiserService$KaiserServiceRunner.run(KaiserService.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:104)
... 9 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:357)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1334)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1309)
at org.apache.avro.ipc.HttpTransceiver.writeBuffers(HttpTransceiver.java:77)
at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:58)
at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:72)
at org.apache.avro.ipc.Requestor.request(Requestor.java:147)
at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
... 9 more

The HAproxy has 2 IP's (all the ip's used for all the servers are accessible, no firewall of anykind involved), each IP has a DNS A record registered for the CMSserver and MGMTserver respectively.

 

 

On the management server, the file /etc/cloudera-scm-agent/config.ini has

server_host=CMSserver (the DNS name reserved on the proxy server for the cloudera manager)

listening_hostname=MGMTserver (the DNS name reserved on the proxy for the management service host)

On the mangement server, the /etc/hosts also has an entry for its local IP pointing to the MGMTserver name(the proxy host) as per the documentation.

 

I did a Inspect Host and the error of that is as follows:

Mithun119_2-1585661972123.png

Additionally the detailed logs on the UI cant be accessed. The below error is thrown:

Mithun119_3-1585662686769.png

I also tried dropping the amon database that was migrated from the old setup assuming some data corruption. No luck.

Running the below python script on the management service host also gives me the proxy servers DNS name as expected, which is what is used in the config.ini for listening_hostname.

python -c 'import socket; print socket.getfqdn(),socket.gethostbyname(socket.getfqdn())'

 

The HAproxy config is below.

Spoiler

frontend cmf
bind *:7180
mode tcp
option tcplog
default_backend cmf

backend cmf
server cmfhttp1 cloudera-managerA.sg.com:7180 check
server cmfhttp2 cloudera-managerB.sg.com:7180 check

frontend cmfavro
bind *:7182
mode tcp
option tcplog
default_backend cmfavro

backend cmfavro
server cmfavro1 cloudera-managerA.sg.com:7182 check
server cmfavro2 cloudera-managerB.sg.com:7182 check

frontend mgmt1
bind *:5678
mode tcp
option tcplog

backend mgmt1
server mgmt1a management-serverA.sg.com check
server mgmt1b management-serverB.sg.com check

frontend mgmt2
bind *:7184
mode tcp
option tcplog

backend mgmt2
server mgmt2a management-serverA.sg.com check
server mgmt2b management-serverB.sg.com check

frontend mgmt3
bind *:7185
mode tcp
option tcplog

backend mgmt3
server mgmt3a management-serverA.sg.com check
server mgmt3b management-serverB.sg.com check

frontend mgmt4
bind *:7186
mode tcp
option tcplog

backend mgmt4
server mgmt4a management-serverA.sg.com check
server mgmt4b management-serverB.sg.com check

frontend mgmt5
bind *:7187
mode tcp
option tcplog

backend mgmt5
server mgmt5a management-serverA.sg.com check
server mgmt5b management-serverB.sg.com check

frontend mgmt6
bind *:8083
mode tcp
option tcplog

backend mgmt6
server mgmt6a management-serverA.sg.com check
server mgmt6b management-serverB.sg.com check

frontend mgmt7
bind *:8084
mode tcp
option tcplog

backend mgmt7
server mgmt7a management-serverA.sg.com check
server mgmt7b management-serverB.sg.com check

frontend mgmt8
bind *:8086
mode tcp
option tcplog

backend mgmt8
server mgmt8a management-serverA.sg.com check
server mgmt8b management-serverB.sg.com check

frontend mgmt9
bind *:8087
mode tcp
option tcplog

backend mgmt9
server mgmt9a management-serverA.sg.com check
server mgmt9b management-serverB.sg.com check

frontend mgmt10
bind *:8091
mode tcp
option tcplog

backend mgmt10
server mgmt10a management-serverA.sg.com check
server mgmt10b management-serverB.sg.com check

frontend mgmt-agent
bind *:9000
mode tcp
option tcplog

backend mgmt-agent
server mgmt-agenta management-serverA.sg.com check
server mgmt-agentb management-serverB.sg.com check

frontend mgmt11
bind *:9994
mode tcp
option tcplog

backend mgmt11
server mgmt11a management-serverA.sg.com check
server mgmt11b management-serverB.sg.com check

frontend mgmt12
bind *:9995
mode tcp
option tcplog

backend mgmt12
server mgmt12a management-serverA.sg.com check
server mgmt12b management-serverB.sg.com check

frontend mgmt13
bind *:9996
mode tcp
option tcplog

backend mgmt13
server mgmt13a management-serverA.sg.com check
server mgmt13b management-serverB.sg.com check

frontend mgmt14
bind *:9997
mode tcp
option tcplog

backend mgmt14
server mgmt14a management-serverA.sg.com check
server mgmt14b management-serverB.sg.com check

frontend mgmt15
bind *:9998
mode tcp
option tcplog

backend mgmt15
server mgmt15a management-serverA.sg.com check
server mgmt15b management-serverB.sg.com check

frontend mgmt16
bind *:9999
mode tcp
option tcplog

backend mgmt16
server mgmt16a management-serverA.sg.com check
server mgmt16b management-serverB.sg.com check

frontend mgmt17
bind *:10101
mode tcp
option tcplog

backend mgmt17
server mgmt17a management-serverA.sg.com check
server mgmt17b management-serverB.sg.com check

Struggling from 4 days, any help would be much appreciated. 

 

1 ACCEPTED SOLUTION

avatar
Explorer

I managed to fix this. It was a faulty haproxy config. For the management services, I was missing the default_backend. The issue has thus been resolved.

View solution in original post

3 REPLIES 3

avatar
Explorer

Hello Everyone,

Can someone please provide some advises about the fix.

 

Thanks,

Mithun.

avatar
Explorer

On the MGMT server, I set the listening_hostname to its own hostname instead of the LB name set on the haproxy server, and it works fine. I suspect this is to do with the haproxy config however, I have done exactly as dictated in the cloudera documetation. Not sure what is missing.

avatar
Explorer

I managed to fix this. It was a faulty haproxy config. For the management services, I was missing the default_backend. The issue has thus been resolved.