06-14-2017 02:57 PM
Hi, we have a Hadoop cluster in AWS managed by ClouderaManager, our cluster consist of about "static" 50 nodes running the bllow services :
* FailOver Controller
* Thrift Server
* JobHistory Server
We recently start using Spot nodes to deploy more Yarn NodeManager nodes and scale in or out based on the running jobs demanda and capacity. When Spot instances (Yarn NodeManager) are getting terminated or new nodes are added to the cluster we are seeing an weird issue where the zookeeper services goes down affecting the whole cluster. Basically the whole cluster gets unhealthy and ClouderaManager indicates that the Zookeeper ServerId is unset. We have checked the Zookeeper logs and they do not contain any errors or indicate any malfunction, the problem seems to be related with CloudreaManager unsetting the zoopkeeper ids.
When this happens the steps we take to it is to set the Zookeeper id manually on ClouderaManager for each node , the ID config boxes are empty and we just set the id to 1,2,3 according to the node number..
Another thing we noticed is that Cloudera Manager server is spiking in cpu utilization when the spot nodes are being added or removed.
Version: Cloudera Express 5.7.2 (#17 built by jenkins on 20160722-1347 git: 1ac5976e8ad8f16506c2db236aee83141915c44d)
Java VM Name: Java HotSpot(TM) 64-Bit Server VM
Java VM Vendor: Oracle Corporation
Java Version: 1.8.0_101
Current parcel version we have deployed in the cluster is -> 5.7.2-1.cdh5.7.2.p0.18 .
This is causing serious issues in our production cluster and we are considering moving to EMR , if anyone has a simillar experience or could to point where to look in order to troubleshoot and fix this I would be very grateful.
Thanks in advance.
06-15-2017 01:37 PM
06-16-2017 02:43 PM
Thanks for the answer mbigelow.
>> If it is CM, you should be able to see the changes made and potentially who is making them by viewing the configuration history. It is possible that you have found a bug
This is not a change done by a user account for sure , I have enabled Audit collection but I'm not sure where to look ? can you give some directions ?
>> If you need to scale CM you could look at offloading the CMS services (specifically amon and rmon as they use an external DB) to another host.
I couldn't find many information about offloading CMS services (and I believe you meant *rman* instead of rmon) but our Cloudera Manager is running on an ec2 instance and the DB is hosted separately in an RDS instance (PostgreSQL 9.4.7 / db.m4.large) if thats what you mean. If that's not the case , can you provide some links ?
06-16-2017 04:01 PM