Community Articles

subratadas · ‎05-28-2021

The file "cdswCluster.runs.jsonlines" in the output contents of "cdsw logs" contains session information of all projects. But there should be a regular archiving operation, that is, it may only keep a fixed number of records/ records within a fixed time range. I can't confirm this now.

I have sorted out how to query this information through the curl command:

On the CDSW Master node:

authHeader="Service-Authorization: Basic $(printf cdsw-metrics-service:$SERVICE_ACCOUNT_SECRET | base64 -w 0)"
domain=$(kubectl get secrets internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data.domain}" | base64 -d)
apiUrl="${protocol}://${domain}/api/v1"
datasets_json=$(curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/names")
if [ 0 -eq "$?" ]; then
  datasets=$(echo "$datasets_json" | python2 -c "import sys, json; print '\n'.join(json.load(sys.stdin))")
  for dataset_name in $datasets; do
    curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/values/${dataset_name}" >$(pwd)/${dataset_name}.jsonlines
  done
fi

# ls -lht
total 8.0K
-rw-r--r-- 1 root root 1.4K May 27 06:46 cdswCluster.runs.jsonlines

# cat cdswCluster.runs.jsonlines
{"id":"ho78kjgjdqrxzllx","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T14:28:47.368Z","schedulingAt":"2021-05-26T14:28:47.368Z","startingAt":"2021-05-26T14:29:15.020Z","stoppingAt":null,"runningAt":"2021-05-26T14:29:15.020Z","finishedAt":"2021-05-26T14:44:18.067Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"oxjrcug0ogbuk335","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T13:38:39.098Z","schedulingAt":"2021-05-26T13:38:39.098Z","startingAt":"2021-05-26T13:47:14.359Z","stoppingAt":null,"runningAt":"2021-05-26T13:47:14.359Z","finishedAt":"2021-05-26T14:02:18.068Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"0x2fm8tztifwh9cv","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T04:02:06.319Z","schedulingAt":"2021-05-26T04:02:06.319Z","startingAt":"2021-05-26T04:02:06.319Z","stoppingAt":"2021-05-26T04:04:18.687Z","runningAt":"2021-05-26T04:02:06.319Z","finishedAt":"2021-05-26T04:04:18.946Z","exitCode":-1,"killedByTimeout":false,"killedByUser":true}

# echo $authHeader
Service-Authorization: Basic Y2Rzdy1tZXRyaWNzLXNlcnZpY2U6NVJ4UlRaSFdVc2FENkhFZHowVGhiWGZRNnBZU1EzbjI2N2wxVlFLRw==
# echo $domain
host-10-17-102-138.coe.cloudera.com
# echo $apiUrl
http://host-10-17-102-138.coe.cloudera.com/api/v1
# echo $datasets_json
["cdswCluster.runs"]
# which python2
/usr/bin/python2

The above content is extracted from the script "cdsw-dump-metrics.sh" used by CDSW.
You can combine your own needs and write a simple script to query once a day. After each query, it will be merged with the last result to remove duplicates. Because it is a JSON data structure, it should be easy to use Python's JSON package, deduplicating the records, so that you can ensure that the information of each month is completely collected.

ShounenG · ‎08-19-2021

Add the origin of SERVICE_ACCOUNT_SECRET:

if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
  # Attempt to get from kubectl.  Generally, this only works from the
  # kubernetes master node.
  SERVICE_ACCOUNT_SECRET=$(kubectl get secret internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data['service\.account\.secret']}" | base64 -d)
fi
if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
  die_with_error 2 "Unable to get service account credentials.  Provide SERVICE_ACCOUNT_SECRET or run on master node."
fi

Cloudera Community

Community Articles

How to call CDSW's API to get all sessions/jobs list created by users

Cloudera Data Platform (CDP)

Cloudera Data Science Workbench (CDSW)

Re: How to call CDSW's API to get all sessions/jobs list created by users