Created 07-11-2018 09:07 AM
Hello!
Not sure if this is the right place but...
We use Streamsets to load data into a series of databases within our HDFS cluster. However, each time the cluster is restarted, the pipelines all drop into "START_ERROR" state when Streamsets starts - I assume because it's trying to start multiple pipelines on a single Streamsets host at the same time.
Is there a way of getting Cloudera to run a script before it stops the Streamsets service? We have the script already as we use it to stop the pipelines ahead of doing any batch processing on the data. Currently we have a manual process to run the script (just a series of curl calls into the Streamsets API)
We are running CDH 5.9.0 with Cloudera Manager 5.9 currently.
Any advice would be gratefully received.
Thanks
Ben
Created 07-11-2018 11:10 AM
Based on the information provided, I believe the information you seek would best be provided by StreamSets as they build the parcel and CSD. When you restart, Cloudera Manager will signal the StreamSets service to restart, but the handling of that action is done at the CSD level which is created by the vendor (not Cloudera).
I think this is where they handle their questions:
https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user
Created 07-11-2018 11:10 AM
Based on the information provided, I believe the information you seek would best be provided by StreamSets as they build the parcel and CSD. When you restart, Cloudera Manager will signal the StreamSets service to restart, but the handling of that action is done at the CSD level which is created by the vendor (not Cloudera).
I think this is where they handle their questions:
https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user
Created 07-11-2018 11:22 PM
Created 07-12-2018 08:47 AM
Cloudera Manager's steps for starting servers are based on internal dependencies, so there isn't a configuration in Cloudera Manager that could be used to change it. I suspect that the StreamSets fix may have been around dependencies, in the CSD, but that's just a guess.
The only alternative I can see at this time is to bring the services up one at a time.
You could do that with the API and wrap it in the script:
You can see all REST API stuff here: https://cloudera.github.io/cm_api/apidocs/v14/
For instance, you could write a shell script that executes a serice of "curl" commands that start the services.
There is a complexity in that you need to wait till the service is really running before started, but that can be accomplished by using:
and parsing out the status. If a service is "STARTED" start the next one...
"serviceState" : "STARTED",
It might be something to play with if you are stuck in this state long term. The API is in Java and Python and you can find examples of usage here too:
https://cloudera.github.io/cm_api/