Created on 05-28-2021 02:02 AM - edited on 05-30-2021 09:00 PM by subratadas
The file "cdswCluster.runs.jsonlines" in the output contents of "cdsw logs" contains session information of all projects. But there should be a regular archiving operation, that is, it may only keep a fixed number of records/ records within a fixed time range. I can't confirm this now.
I have sorted out how to query this information through the curl command:
On the CDSW Master node:
authHeader="Service-Authorization: Basic $(printf cdsw-metrics-service:$SERVICE_ACCOUNT_SECRET | base64 -w 0)"
domain=$(kubectl get secrets internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data.domain}" | base64 -d)
apiUrl="${protocol}://${domain}/api/v1"
datasets_json=$(curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/names")
if [ 0 -eq "$?" ]; then
datasets=$(echo "$datasets_json" | python2 -c "import sys, json; print '\n'.join(json.load(sys.stdin))")
for dataset_name in $datasets; do
curl -sSf -H "$authHeader" "${apiUrl}/metrics/datasets/values/${dataset_name}" >$(pwd)/${dataset_name}.jsonlines
done
fi
# ls -lht
total 8.0K
-rw-r--r-- 1 root root 1.4K May 27 06:46 cdswCluster.runs.jsonlines
# cat cdswCluster.runs.jsonlines
{"id":"ho78kjgjdqrxzllx","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T14:28:47.368Z","schedulingAt":"2021-05-26T14:28:47.368Z","startingAt":"2021-05-26T14:29:15.020Z","stoppingAt":null,"runningAt":"2021-05-26T14:29:15.020Z","finishedAt":"2021-05-26T14:44:18.067Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"oxjrcug0ogbuk335","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T13:38:39.098Z","schedulingAt":"2021-05-26T13:38:39.098Z","startingAt":"2021-05-26T13:47:14.359Z","stoppingAt":null,"runningAt":"2021-05-26T13:47:14.359Z","finishedAt":"2021-05-26T14:02:18.068Z","exitCode":34,"killedByTimeout":false,"killedByUser":false}
{"id":"0x2fm8tztifwh9cv","kernel":"scala","engineImageId":16,"cpu":1,"memory":2,"gpuCount":0,"creatorId":1,"creatorType":"user","isJob":false,"isShared":false,"createdAt":"2021-05-26T04:02:06.319Z","schedulingAt":"2021-05-26T04:02:06.319Z","startingAt":"2021-05-26T04:02:06.319Z","stoppingAt":"2021-05-26T04:04:18.687Z","runningAt":"2021-05-26T04:02:06.319Z","finishedAt":"2021-05-26T04:04:18.946Z","exitCode":-1,"killedByTimeout":false,"killedByUser":true}
# echo $authHeader
Service-Authorization: Basic Y2Rzdy1tZXRyaWNzLXNlcnZpY2U6NVJ4UlRaSFdVc2FENkhFZHowVGhiWGZRNnBZU1EzbjI2N2wxVlFLRw==
# echo $domain
host-10-17-102-138.coe.cloudera.com
# echo $apiUrl
http://host-10-17-102-138.coe.cloudera.com/api/v1
# echo $datasets_json
["cdswCluster.runs"]
# which python2
/usr/bin/python2
The above content is extracted from the script "cdsw-dump-metrics.sh" used by CDSW.
You can combine your own needs and write a simple script to query once a day. After each query, it will be merged with the last result to remove duplicates. Because it is a JSON data structure, it should be easy to use Python's JSON package, deduplicating the records, so that you can ensure that the information of each month is completely collected.
Created on 08-19-2021 05:26 AM
Add the origin of SERVICE_ACCOUNT_SECRET:
if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
# Attempt to get from kubectl. Generally, this only works from the
# kubernetes master node.
SERVICE_ACCOUNT_SECRET=$(kubectl get secret internal-secrets --namespace=${CDSW_NAMESPACE} -o jsonpath="{.data['service\.account\.secret']}" | base64 -d)
fi
if [ -z "$SERVICE_ACCOUNT_SECRET" ]; then
die_with_error 2 "Unable to get service account credentials. Provide SERVICE_ACCOUNT_SECRET or run on master node."
fi