Created 03-27-2017 12:20 PM
Hi all! I'm rebuilding my sandbox cluster to use an external mysql database, and I think I'm following Cloudera's step-by-step instructions. I created the various databases, including one for the management service itself, on my mysql server outside the cluster. Then I wiped out all the server and agent software in my cluster so I could do a fresh yum install (I'm using CentOS). On the manager server, I reinstalled the cloudera-manager-server software, and then used the scm_prepare_database.sh script to set up the connection the external db; I got "Success".
Then I fired up the cluster-scm-server, waited for it to come fully online, logged into the web UI, and was prompted to go through the usual steps. It successfully installed the agents on all 4 of my CentOS nodes, but then when it tried to distribute the parcels, it complained that all the hosts had bad health. I clicked out to the main desktop page to look at the hosts, and sure enough, all of them have "unknown" health. I assume that's because there's no management service set up yet, so I go to do that, but it refuses to test the connection to the database because my manager server's health is bad:
Unable to test database connection for host not in good state.
When I check the server log, I get more or less the same message:
2017-03-27 15:14:04,466 INFO 266434700@scm-web-16:com.cloudera.cmf.model.DbCommand: Command null(RepMgrTestDatabaseConnection) has completed. finalstate:FINISHED, success:false, msg:Unable to test database connection for host not in good state.
I don't think it's the database connection because I can connect from my manager server using the mysql command line client with my user and password. Seems like it won't build the management service because it can't test the database; it can't test the database because the manager server has "bad" health; all the hosts have "unknown" health because there's no management service tracking them. Argh! Any idea how I break this circle? I don't think I screwed up the initial SCM database set up because when I connect to the database I see a bunch of tables in that db that must've been created by Cloudera, since I didn't do it.
Other details:
These are all VMWare VMs, running CentOS 6.8. I'm attempting to install CMS 5.10 and CDH 5.10.
Created 03-28-2017 01:06 PM
Problem solved! You pointed me in the right direction. A check of the agent log showed this error:
[28/Mar/2017 11:28:09 +0000] 7731 MainThread agent ERROR Error, CM server guid updated, expected 240da00c-05c4-4053-b8a1-5ba957dfab5f, received 46d4b8a7-c2ac-4eae-8ce6-758d94046a26
When I googled it, it said I should wipe out /var/lib/cloudera-scm-agent/cm_guid. Did that, and now things seem to be working fine. Thanks!
Created 03-28-2017 10:14 AM
Hello,
Thanks for reaching out to the community.
After reading the scenario, I think it is quite possible that on the sandbox there may have some left over things not cleaned up before a reinstall. That may cause all the agent nodes not heartbeat to CM server successfully. By default, agent sends a heartbeat to CM server to report its health. Right now, this is broken at your cluster.
You are not in a "cirle" since the Cloudera Management Service has nothing to do with this issue. Once you resolve the agent not able to talk to CM server issue, you should be able to install Cloudera Management Service easily.
The question right now is how to resolve the agent issue? In order to find out what went wrong, we need to look into:
1) CM agent log which is located on the agent host. By default, the path is /var/log/cloudera-scm-agent/cloudera-scm-agent.log
2) CM server log which is located on the CM host. By default, the path is /var/log/cloudera-scm-server/cloudera-scm-server.log
Please post above information so we can take into this issue together.
Thanks,
Li
Cloudera support
Li Wang, Technical Solution Manager
Created 03-28-2017 01:06 PM
Problem solved! You pointed me in the right direction. A check of the agent log showed this error:
[28/Mar/2017 11:28:09 +0000] 7731 MainThread agent ERROR Error, CM server guid updated, expected 240da00c-05c4-4053-b8a1-5ba957dfab5f, received 46d4b8a7-c2ac-4eae-8ce6-758d94046a26
When I googled it, it said I should wipe out /var/lib/cloudera-scm-agent/cm_guid. Did that, and now things seem to be working fine. Thanks!
Created 03-28-2017 01:34 PM
You are very welcome! Glad to hear the issue got resolved.
Cheers,
Li
Li Wang, Technical Solution Manager