I'm attempting to build a new CDH 5.6.0 with Spark (1.5.0+cdh5.6.0+113) on Ubuntu 14.0.4 LTS.
I can get Cloudera Manager installed, it detects the hosts correctly, and we only get an error when we get to the Role Assignment page. The exact error we're getting is:
2016-04-04 22:04:47,181 INFO 2078186713@scm-web-41:com.cloudera.server.web.common.JFrameException: Exception report generated accessing http://cloudera-mgr.domain.com:7180/cmf/clusters/3/express-add-services/update Exception executing consequence for rule "Compute hiveserver2_spark_executor_cores" in com.cloudera.cmf.rules: org.drools.RuntimeDroolsException: java.lang.ArithmeticException: / by zero
Which seems to imply that the number of cores on the selected nodes is zero?
If I navigate to /cmf/hardware/hosts on CM, I can verify that all of the nodes are sending a proper heartbeat, and that they all have a non-zero number of cores listed. The exact hardware on the nodes contains two Intel(R) Xeon(R) CPU E5-2660 @ 2.20GHz, which shows up as Cores: 16 (32 w/ Hyperthreading).
Has anyone ran into this type of problem, or know how to get past it?
In case anyone else runs into this issue, the problem was isolated down to a handful of Hosts in the new cluster which were selected to run the Role of NodeManager (YARN). Which by default simply mirrors the DataNode assignment.
In order to get past this error message in Cloudera Manager, I had to deselect the Hosts causing this error from only the NodeManager Role assignment. It took a bit of trial and error, but eventually we discovered the Host the install wizard didn't like (which had a non-zero number of Cores). Omitting that and clicking Continue allowed it to get past the error message.
After the rest of the services were installed and activated, I could go back into the YARN instance and add the Hosts that triggered the error during the installation wizard as NodeManager and they started without issue. It's been a few hours, and I still have Green icons next to all the services, so I'm assuming everything is fine now.
Not sure why NodeManager would be triggering a divide by zero for hiveserver2_spark_executor_cores, nor how it determined there were zero cores when all of the Hosts have a non-zero core being reported by CM.