Created on 07-22-2019 07:44 AM - edited 09-16-2022 08:52 AM
Ran into an interesting scenario in Director that I have not solved yet but will (hopefully) today.
Created a cluster via the python sdk and when attempting to repair a node in any instance group I get "A template must be specified". There is nothing in the logs and this instance group only contains gateway roles for HDFS and YARN.
This is Director v6.1 and there are no "templates" in the templates tab since everything was programmatically generated.
Screenshot for reference:
Created 07-22-2019 03:39 PM
Yep that works fine! So I am explicitly creating the templates first so Director can validate that it likes them before creating the DeploymentTemplate or the ClusterTemplates.
It was an issue related to the names of the InstanceTemplates were > 40 characters and Director accepted the ClusterTemplate with those InstanceTemplates.
Cheers!
Created 07-22-2019 08:13 AM
Created on 07-22-2019 08:40 AM - edited 07-22-2019 08:54 AM
Nothing in the logs at all related to this and seems to flag the instance group that is solely comprised of gateways when trying to repair any group.
What's interesting is you can click view template and it shows all the expected information above the error. Is there a specific logging I can turn on for this in director that wouldn't flood the logs?
Created 07-22-2019 10:36 AM
Upgraded to Director v6.2.1 and see the same behavior when repairing / shrinking / growing any of these clusters created in a similar fashion.
Turned on full debug at root level in logback and wading through a ton of logs at the moment and hoping something stands out.
Created 07-22-2019 02:59 PM
I have a strong suspicion that Director accepted a ClusterTemplate with a VirtualInstanceGroup comprised of VirtualInstances that had InstanceTemplates with names larger than 40 characters.. 41 to be exact for this one group that is failing.
I discovered the following when making a smaller PoC a moment ago to try to only create the InstanceTemplates so that they would appear in Director and then manually add to see if issue was same:
(The name must have a length of 2-40 characters, the first and last of which must be alphanumeric. The rest may include space, underscore, and hyphen.)"
Will be trimming the names to under 40 characters and deploying a new cluster to see if the issue still persists.
However, I do believe this is a bug that Director can get into this state!
Created 07-22-2019 03:39 PM
Yep that works fine! So I am explicitly creating the templates first so Director can validate that it likes them before creating the DeploymentTemplate or the ClusterTemplates.
It was an issue related to the names of the InstanceTemplates were > 40 characters and Director accepted the ClusterTemplate with those InstanceTemplates.
Cheers!
Created 07-22-2019 04:32 PM
Created on 07-24-2019 09:04 AM - edited 07-24-2019 09:04 AM
@Mike Wilson On that note I managed to find another issue with the validator.
If you create a VirtualInstanceGroup similar to the python example here: https://github.com/cloudera/director-sdk/blob/master/python-client-samples/cluster.py#L159
And then change the name in the VirtualInstanceGroup to masters.1 (add an invalid character to it like a .) it will create the cluster successfully but then the user in the UI will not be able to repair nodes / grow / shrink nor clone the cluster.
It will also just gray out the continue button and provide no feedback to the user in the logs nor the UI.
Cheers!