Created on 05-28-202102:02 AM - edited on 04-21-202602:02 AM by GrazittiAPI
The file "cdswCluster.runs.jsonlines" in the output contents of "cdsw logs" contains session information of all projects. But there should be a regular archiving operation, that is, it may only keep a fixed number of records/ records within a fixed time range. I can't confirm this now.
I have sorted out how to query this information through the curl command:
On the CDSW Master node:
authHeader="Service-Authorization: Basic $(printf cdsw-metrics-service:$SERVICE_ACCOUNT_SECRET | base64 -w 0)"
domain=$(kubectl get secrets internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data.domain}" | base64 -d)
apiUrl="${protocol}://${domain}/api/v1"
datasets_json=$(curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/names")
if [ 0 -eq "$?" ]; then
datasets=$(echo "$datasets_json" | python2 -c "import sys, json; print '\n'.join(json.load(sys.stdin))")
for dataset_name in $datasets; do
curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/values/${dataset_name}" >$(pwd)/${dataset_name}.jsonlines
done
fi
# ls -lht
total 8.0K
-rw-r--r-- 1 root root 1.4K May 27 06:46 cdswCluster.runs.jsonlines
# cat cdswCluster.runs.jsonlines
{"id":"ho78kjgjdqrxzllx","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T14:28:47.368Z","schedulingAt":"2021-05-26T14:28:47.368Z","startingAt":"2021-05-26T14:29:15.020Z","stoppingAt":null,"runningAt":"2021-05-26T14:29:15.020Z","finishedAt":"2021-05-26T14:44:18.067Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"oxjrcug0ogbuk335","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T13:38:39.098Z","schedulingAt":"2021-05-26T13:38:39.098Z","startingAt":"2021-05-26T13:47:14.359Z","stoppingAt":null,"runningAt":"2021-05-26T13:47:14.359Z","finishedAt":"2021-05-26T14:02:18.068Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"0x2fm8tztifwh9cv","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T04:02:06.319Z","schedulingAt":"2021-05-26T04:02:06.319Z","startingAt":"2021-05-26T04:02:06.319Z","stoppingAt":"2021-05-26T04:04:18.687Z","runningAt":"2021-05-26T04:02:06.319Z","finishedAt":"2021-05-26T04:04:18.946Z","exitCode":-1,"killedByTimeout":false,"killedByUser":true}
The above content is extracted from the script "cdsw-dump-metrics.sh" used by CDSW. You can combine your own needs and write a simple script to query once a day. After each query, it will be merged with the last result to remove duplicates. Because it is a JSON data structure, it should be easy to use Python's JSON package, deduplicating the records, so that you can ensure that the information of each month is completely collected.
if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
# Attempt to get from kubectl. Generally, this only works from the
# kubernetes master node.
SERVICE_ACCOUNT_SECRET=$(kubectl get secret internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data['service\.account\.secret']}" | base64 -d)
fi
if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
die_with_error 2 "Unable to get service account credentials. Provide SERVICE_ACCOUNT_SECRET or run on master node."
fi