Community Articles

Find and share helpful community-sourced technical articles.
avatar
Rising Star

The file "cdswCluster.runs.jsonlines" in the output contents of "cdsw logs" contains session information of all projects. But there should be a regular archiving operation, that is, it may only keep a fixed number of records/ records within a fixed time range. I can't confirm this now.

I have sorted out how to query this information through the curl command:

On the CDSW Master node:

 

authHeader="Service-Authorization: Basic $(printf cdsw-metrics-service:$SERVICE_ACCOUNT_SECRET | base64 -w 0)"
domain=$(kubectl get secrets internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data.domain}" | base64 -d)
apiUrl="${protocol}://${domain}/api/v1"
datasets_json=$(curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/names")
if [ 0 -eq "$?" ]; then
  datasets=$(echo "$datasets_json" | python2 -c "import sys, json; print '\n'.join(json.load(sys.stdin))")
  for dataset_name in $datasets; do
    curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/values/${dataset_name}" >$(pwd)/${dataset_name}.jsonlines
  done
fi

 

# ls -lht
total 8.0K
-rw-r--r-- 1 root root 1.4K May 27 06:46 cdswCluster.runs.jsonlines

# cat cdswCluster.runs.jsonlines
{"id":"ho78kjgjdqrxzllx","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T14:28:47.368Z","schedulingAt":"2021-05-26T14:28:47.368Z","startingAt":"2021-05-26T14:29:15.020Z","stoppingAt":null,"runningAt":"2021-05-26T14:29:15.020Z","finishedAt":"2021-05-26T14:44:18.067Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"oxjrcug0ogbuk335","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T13:38:39.098Z","schedulingAt":"2021-05-26T13:38:39.098Z","startingAt":"2021-05-26T13:47:14.359Z","stoppingAt":null,"runningAt":"2021-05-26T13:47:14.359Z","finishedAt":"2021-05-26T14:02:18.068Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"0x2fm8tztifwh9cv","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T04:02:06.319Z","schedulingAt":"2021-05-26T04:02:06.319Z","startingAt":"2021-05-26T04:02:06.319Z","stoppingAt":"2021-05-26T04:04:18.687Z","runningAt":"2021-05-26T04:02:06.319Z","finishedAt":"2021-05-26T04:04:18.946Z","exitCode":-1,"killedByTimeout":false,"killedByUser":true}
# echo $authHeader
Service-Authorization: Basic Y2Rzdy1tZXRyaWNzLXNlcnZpY2U6NVJ4UlRaSFdVc2FENkhFZHowVGhiWGZRNnBZU1EzbjI2N2wxVlFLRw==
# echo $domain
host-10-17-102-138.coe.cloudera.com
# echo $apiUrl
http://host-10-17-102-138.coe.cloudera.com/api/v1
# echo $datasets_json
["cdswCluster.runs"]
# which python2
/usr/bin/python2

The above content is extracted from the script "cdsw-dump-metrics.sh" used by CDSW.
You can combine your own needs and write a simple script to query once a day. After each query, it will be merged with the last result to remove duplicates. Because it is a JSON data structure, it should be easy to use Python's JSON package, deduplicating the records, so that you can ensure that the information of each month is completely collected.

1,337 Views
0 Kudos
Comments

Add the origin of SERVICE_ACCOUNT_SECRET:

 

if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
  # Attempt to get from kubectl.  Generally, this only works from the
  # kubernetes master node.
  SERVICE_ACCOUNT_SECRET=$(kubectl get secret internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data['service\.account\.secret']}" | base64 -d)
fi
if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
  die_with_error 2 "Unable to get service account credentials.  Provide SERVICE_ACCOUNT_SECRET or run on master node."
fi