About Gopinath

Gopinath · ‎01-27-2025

I did some testing on my test workspace, below are the steps followed: > suspended the workspace > in AWS console, EC2 Autoscaling group set the min & desired count for the liftie infra node group as 0 > after completion of the update, validated in the CDP management console workspace details, I saw the count on Platform infra went to 0 > in the EC2 auto scaling group, set desired and min count as 2 in the liftie infra node group and once update completed, validated same on Workspace details > resumed the workspace and validated session launch was successful Observations: Indeed, I was able to scale down all the AWS instance to 0 Suggestions: Would not suggest this on PROD or any critical environments [not sure what could be the effect on workspace that run on production scale] In case of non prod, test in thoroughly and "implement on your own risk" Alternates: Why do you consider backup and restore operations? When a workspace was not required, instead of suspending and manually scaling down the nodes can you consider backup the workspace and restore the backup when required. Anyhow, since you are planning for a complete shutdown it won't be a prod workspace. Backup/Restore could be an alternate option for non-prod workspace and again it is a cost effective option.

Gopinath · ‎01-27-2025

In general, running any commands on the Cloud Provider console is not recommended as it could case the Cloud instances and CDP control plane out of sync. In case of CML, a liftie/Platform infra instance group with 2 m5.large and an m5.2xlarge on CAI Infra instance group will be running even the workspace is suspended status. Stopping those nodes manually are not tested and it could result into unexpected scenario.

Gopinath · ‎01-16-2025

"cml could not fetch the image metadata" indicate that runtime manager could not fetch the image details. Check #1 if the regcred secret is in place or secret is valid [1] #2 if the network route to the registry is open "http:server gave response to HTTPS client" are the protocols of cml workspace[tls/non-tls] and registry matching? Checking the runtime manager pods may give more insights. [1] https://docs.cloudera.com/machine-learning/cloud/runtimes/topics/ml-add-docker-registry-credentials-runtimes.html

Gopinath · ‎01-14-2025

@MID_ACN there is no straightforward way to do it. Option#1: Forking [through UI] https://community.cloudera.com/t5/Internal/How-to-change-the-owner-of-the-project-in-CML/ta-p/360168 Option#2: Projects table update [requires cli access to the underlying pod] step1: connect to postgres DB kubectl exec -it $(kubectl get pods -l role=db -o jsonpath='{.items[*].metadata.name}') -- psql -P pager=off --expanded -U sense step2: get user IDs, project ID select id from users where username='<old_owner>'; select id from users where username='<new_owner>'; select id,creator_id,user_id from projects where name='<project_name>'; step#3: update projects table select id,name,user_id,creator_id from projects where project_id='<project_id>'; #validate before update] update projects set user_id='new_owner_id', creator_id='new_owner_id' where project_id='<project_id>'; #update

Gopinath · ‎01-03-2025

you may consider having a custom built pbj runtime image as per your requirements.

Gopinath · ‎01-03-2025

1) When is V1 getting depreciated? it is already https://docs.cloudera.com/machine-learning/cloud/jobs-pipelines/topics/ml-rest-apis.html -- The Jobs API is now deprecated. See Cloudera AI API v2 and API v2 usage for the successor API. -- Is it recommended to use V1 API now? Obviously, No. 2) Which API method is to be used in the Version 2 API call to start the job? https://docs.cloudera.com/machine-learning/1.5.4/rest-api-reference/index.html#api-CMLService-createJobRun may be createjob, createjobrun are confusing

Gopinath · ‎11-12-2024

@paulfg Application getting stuck at "starting" status can occur when underlying script execution has not completed or the process flow does not think it had completed what it supposed to. Please check/share the "Application logs" pane for any error/concerning events.

Gopinath · ‎08-18-2024

Pyspark 3.5.2 - python >= 3.8 and <=3.11 ref: https://pypi.org/project/pyspark/3.5.2/

Gopinath · ‎07-12-2024

exit code -1 could be due to any corner case during the runtime. Besides, the terminologies[CDSW app/service/job] being used are confusing. What is the workload? [Is it a CDSW job/CDSW application]

Gopinath · ‎06-27-2024

@littlecong The files need to be uploaded to the individual project. As of now there is no documented provision to share contents between the projects.

Online	Offline
Last Visited	‎03-13-2025 03:05 AM

Member Since	‎08-22-2018 07:50 PM
Last Visited	‎03-13-2025 03:05 AM
Posts	79
Kudos received	10

Cloudera Community

Re: Stop all AWS instances

Re: How to shared LLM Model files between CML Proj...

Re: CML Python Package Installation Security

Re: [CDSW] LDAP User Settings in Grafana

Re: Stop all AWS instances

Re: Stop all AWS instances

Re: cml runtime catalog

Re: CDSW change owner Projects

Re: Clean package environment in CML

Re: CML API call to start the AIML job

Re: CML Application Status

Re: Spark Python Supportability Matrix

Re: CDSW application service going down continuous...

Re: How to shared LLM Model files between CML Proj...