Member since
06-05-2019
126
Posts
133
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1327 | 12-17-2016 08:30 PM | |
955 | 08-08-2016 07:20 PM | |
1956 | 08-08-2016 03:13 PM | |
1938 | 08-04-2016 02:49 PM | |
1768 | 08-03-2016 06:29 PM |
12-01-2023
12:44 PM
Shaun goes into great depth on all the capabilities of CML and DataViz in this blog, using a Python model or script. Here, I'll write the script in R instead of the sample Python script Shaun shares.
There is no need to have an R script with predictive model capabilities. While an R script with predictive model capabilities is an option, I'll demonstrate deploying a simple R script in CML that adds two columns together.
Step 1: Write your R script
As explained above, this script adds two columns together (assuming both columns are of type integer). Within DataViz, we'll pass two columns (of type integer). The cml_model wrapper is necessary with the PBJ runtime, as explained here.
Notice the "json" is what we'll be returning to DataViz, where we expect the "numbers_added" column
input[1] and input[2] are representing the two integer columns being passed library(cml)
add_numbers <- cml_model(function(args) {
json <-
'{
"data": {
"colnames": [
"numbers_added","tester"
],
"coltypes": [
"INT","STRING"
],
"rows": [
]
}
}
'
mydf <- fromJSON(json)
for(row in 1:nrow(args$data$rows)) {
inputs = args$data$rows[row,]
together <- matrix(list(inputs[1]+inputs[2]))
mydf$data$rows <- rbind(mydf$data$rows, together)
}
mode(mydf$data$rows) <- "integer"
return(mydf)
})
Step 2: Deploy your R script as a Model in CML
Notice PBJ as the runtime
Step 3: Testing your deployed R script/model via CML and/or Postman
Below is the JSON input and output: {
"data": {
"colnames": [
"week",
"depttime"
],
"rows": [
[
1,
7
],
[
2,
8
],
[
11,
55
]
]
}
}
output:
{
"data": {
"colnames": [
"numbers_added",
"tester"
],
"coltypes": [
"INT",
"STRING"
],
"rows": [
[
8
],
[
10
],
[
66
]
]
}
}
*The second column "tester" is required from DataViz (where you need more than one column returned)
Step 4: Build a Dataset in DataViz using your deployed model
Using my deployed model's URL and key, I'll plug this into the built-in function within DataViz, passing two columns: https://modelservice.ml-b74f8940-b97.go01-dem.ylcu-atmi.cloudera.site/model
{"accessKey":"mxyouraccesskey724mext3y8"}
Within DataViz, I'll upload my CSV file called "people_numbers.csv": name,number1,number2
ryan,1,2
nicole,3,4
rupert,5,6
nigel,7,8
Creating a new table based on the CSV
Dataset is created from the people_numbers table:
Step 4: Modify the Dataset with a new column called "numbers_added", which will call the CML R model
Clone one of the measure columns (number1 or number2)
Edit the "Copy of number#" column
Change the column name to "numbers_added"
Go to the expression and enter the following then click "APPLY": cviz_rest('{"url":"https://modelservice.ml-b74f8940-b97.go01-dem.ylcu-atmi.cloudera.site/model","accessKey":"withyouraccesskey","colnames":["number1","number2"],"response_colname":"numbers_added"}')
Validate the numbers_added is defined as a Mes and #, then click "SAVE"
Step 5: Add a dashboard with your dataset
Within the last step 4, click the "NEW DASHBOARD" button
The new column "numbers_added" shows the results from the CML R model deployed within step 4.
We're able to create visuals based on the dimensions and measures we choose
That's it! Feel free to import your own R script into CML as a model, and build charts/graphs within DataViz!
... View more
11-01-2023
12:04 PM
It's all about simplicity and cohesion. With the multiple services within CDP, today I'll focus on finding actively running queries in CDW (Impala) from CML. With the three steps below, you'll be able to find actively running query progress:
Step 1: Find the coordinator URL within CDW (Impala)
Within CDW, go to your Virtual Warehouse and select "Edit":
Within the Virtual Warehouse, go to the "WEB UI" page:
Copy the Coordinator Web UI address, in my example I'll remove the https:// "coordinator-web-default-impala.dw-go01-demo-aws.ylcu-atmi.cloudera.site"
Step 2: Since we're connecting to CDW (Impala) from within CML, I'll set my project's Environment Variables to include my username/password
Setting the WORKLOAD_PASSWORD to my workload password
Setting the variable WORKLOAD_USER to my username
Step 3: Within my CML Notebook (in my case PBJ), I'll copy the following code, replacing the coordinator with the coordinator from Step 1:
import os
import requests
import pandas as pd
from tabulate import tabulate
ic = ['coordinator-web-default-impala.dw-go01-demo-aws.ylcu-atmi.cloudera.site']
for c in ic:
r = requests.get('https://{}/queries?json'.format(c),auth=(os.environ["WORKLOAD_USER"], os.environ["WORKLOAD_PASSWORD"]))
running_queries = r.json()['in_flight_queries']
if len(running_queries) > 0:
df = pd.DataFrame(running_queries)
print(tabulate(df[['progress','query_id','stmt','executing','start_time']],headers='keys',tablefmt='psql'))
I can add more columns if necessary such as 'stmt_type','resource_pool','state','default_db','effective_user'.
It's just that easy!
... View more
07-27-2023
01:34 PM
Description Repo Name: cml_vscode_with_python310 Repo Info: VsCode using Python 3.10, instead of the existing VsCode example using Python 3.7. Repo URL: https://github.com/ryancicak/cml_vscode_with_python310
... View more
Labels:
07-27-2023
10:57 AM
I recently received a question "We'd like to have Python and R installed in the docker image for a package. Is it possible to have such customized docker with python >=3.7, R>=4.0 and CUDA==11.7?" I'd like to go through how I accomplished this using the pbj-workbench-r4.1-standard.Dockerfile. This runtime has the essentials (Python and R) except for CUDA (NVIDIA GPU). Therefore, we'll use pbj-workbench-r4.1-standard.Dockerfile as a base image. It's important to note that CML requires a runtime kernel that needs to be either R or Python, it can't be both. I'll go through three different iterations of this image: Iteration 1 - Since this image already has R and Python installed, we'll change the runtime NOT to use R ryancicak/pbj_r-and-python:v9 -> Python 3.10 (GitHub Tag: python/3.10/standard_dockerhub_v9) https://github.com/ryancicak/pbj_r-and-python/blob/python/3.10/standard_dockerhub_v9/Dockerfile Iteration 2 - Go back to using R (as we did initially before removing R as the runtime kernel) in Iteration 1 ryancicak/pbj_r-and-python:v10 -> R 4.1 (GitHub Tag: r/4.1/standard_dockerhub_v10) https://github.com/ryancicak/pbj_r-and-python/blob/r/4.1/standard_dockerhub_v10/Dockerfile Iteration 3 - Searching the nvidia/cuda dockerhub tags, I found version 11.7.1, with similarities to https://github.com/cloudera/ml-runtimes/blob/public-runtimes/pbj-workbench-python3.10-cuda.Dockerfile : # Copyright 2022 Cloudera. All Rights Reserved.
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
RUN apt-key del 7fa2af80 && apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub to # Copyright 2022 Cloudera. All Rights Reserved.
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04
RUN apt-key del 7fa2af80 && apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub *Notice "-cudnn8-devel-ubuntu20.04" is identical, as well as the following "RUN" line *Validating cuda 11.7 is running If we compare the changes between iteration 1 and 2, it's ever so slight when comparing tags on the Dockerfile. Again, we're only changing the runtime kernel since the image has both R and Python installed. Feel free to try these different iterations in the Docker Hub tags as specified within CML by adding these as new Runtimes: Iteration 1 - ryancicak/pbj_r-and-python:v9 Iteration 2 - ryancicak/pbj_r-and-python:v10 Iteration 3 - ryancicak/pbj_r-and-python:v11 While the versioning is different in CML, they're all basically using the same docker image as asked in the original question above. Shout-out to @amarinovszki for his collaboration! Thanks Árpád!
... View more
Labels:
05-31-2023
07:29 AM
Cloudera making hard stuff easy again! Great article Ryan!!
... View more
05-09-2023
02:05 PM
In a recent release, CDW is now fully integrated with RAZ. This means once you provision an environment in CDP with RAZ enabled, adding an S3 bucket into CDW follows the same principles as the services within CDP (CML, CDE, COD, CDF, and DataHub).
I was able to add a new S3 bucket called "ryancicak" in two steps, after receiving the 403 Forbidden error within Hue:
Two steps:
Step 1 - Since I used the -datalake-admin-role for RAZ, I modified the "datalake-admin-s3-policy to include my new bucket "ryancicak" with the two lines:
"arn:aws:s3:::ryancicak", "arn:aws:s3:::ryancicak/*"
*Notice I also removed the subdirectory from my cicak-cdp-stack to include the root directory. This is important to make this modification as well. Instead of "cicak-cdp-stack/my-data", I state "cicak-cdp-stack"
Step 2 - I added a new policy in Ranger under cm_s3 for my bucket "ryancicak", for the users rcicak (me), and Hive.
DONE - When re-running the create table using the bucket "ryancicak", no more errors! The table "rupert" is now reading/writing to the "ryancicak" bucket. Since I have my Ranger (RAZ) policy created, from step 2, I'm able to access this bucket from CDP (CML, CDE, COD, CDF, and DataHub).
Note: I used the default "-datalake-admin-role" that was created through the CloudFormation script in the AWS Quickstart. In my case, I named the stack "cicak-cdp-stack" in the CloudFormation script.
That was EASY:
... View more
03-14-2023
02:56 PM
1 Kudo
Special shout-out to @zoram, who provided these instructions. This article will take you through these steps (executing each step with a screenshot). This article specifies an edge case that involves processing large amounts of non-tabular datasets such as images. The data lake is the preferred storage for CML as object storage can scale to billions of objects with tabular data. If you qualify for this edge case (non-tabular datasets), please delete the dataset from EFS/NFS as soon as possible (after processing), since large amounts of data on EFS/NFS negatively affect backup and recovery times. When it comes to non-tabular data, you may not want to use object storage with standard readers such as Boto3 (for AWS), which has more latency attached to each read. If we move to a faster I/O such as EFS, which is already used by your CML workspace, we can load data to your existing CML workspace EFS. It's a simple process that we'll take you through: Step 1 - When dealing with a large number of files, switch your CML workspace to provisioned throughput mode from the AWS EFS console. Increase this to what you're willing to pay for (I/O) and keep it provisioned throughput until you're done processing the files in EFS. Step 2 - Login to the CML UI, and launch a session in the project. The project you'd like to process files within your workspace's EFS. Step 3 - Within your session, launch the "Terminal Access" Step 4 - Within your Terminal Session, run the following command and write down the output, as you'll need the output later. You can stop your session (it's only for this step). df -h | grep home Step 5 - Using kubectl, you'll determine which node is running ds-vfs. In order to access kubectl on your laptop, you'll need to download the Kubeconfig for the workspace you're accessing. Run the following command: kubectl get pods -n mlx -o wide | grep ds-vfs (In case you need to download the Kubeconfig on the workspace): Step 6 - In case you do not have a security policy in place to access SSH on the node you found in step 3, you'll want to add a security policy (to SSH into the node). After you're done with these steps, feel free to remove this security policy (to SSH into the ds-vfs node. Step 7 - SSH into the node from steps 3 & 4 Step 8 - Determine the EFS mount point, and then sudo su (so you're root moving forward) cat /proc/mounts | grep projects-share | grep -v grafana
sudo su My output: 127.0.0.1:/ /var/lib/kubelet/pods/eb4b15b6-631a-428b-92df-e0d31074e7f9/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20081,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/d3c5b66c-c84a-40cc-bed9-79fb95f3a7ad/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20246,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/1ff3a121-e7fb-4475-974f-92446f65a773/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20316,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/96980148-9630-4f5e-90b6-05970eacdf1f/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20439,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0 Step 9 - Go to the projects file system in my case (you'll pull the last piece from your output in step 3) cd /var/lib/kubelet/pods/eb4b15b6-631a-428b-92df-e0d31074e7f9/volumes/kubernetes.io~csi/projects-share/mount/projects/0/1 Step 10 - Create a new folder that you'll use to load your files mkdir images Step 11 - Validate the directory permissions are set correctly. Files and directories must be owned by 8536:8536 ls -l | grep images output: drwxr-sr-x 2 8536 8536 6144 Mar 13 21:01 images IMPORTANT: If you're loading MANY files - it's beyond important to add this new directory to the gitignore - so these files don't get committed to GIT! Step 12 - Copy your files into the new directory from step 7, from S3. You'll likely want to figure out S3 authn/authz for the copy. It would be a best practice to script this copy and run it with nohup as a background command. That way the copy doesn't get terminated if the SSH session times out. Step 13 - Your files are now available in the new directory. IMPORTANT: It's general best practice to not load all your files in one directory. Instead, create multiple subdirectories and load say 10k files per subdirectory. Within your code (say Python), you can iterate subdirectories and process the files. This way, you aren't overwhelming your code - loading all files at once! Warning: Do the cost estimate on using EFS vs S3 - before going into sticker shock on using higher-cost storage within AWS. Again - after you're done processing your data within EFS/NFS, please remember to delete this data since large amounts of data on EFS/NFS negatively affect backup and recovery times.
... View more
Labels:
03-01-2023
04:14 AM
Nice and Quick! Excellent!
... View more
02-03-2023
12:10 PM
Throughout my seven years working with Cloudera/Hortonworks, I'm always learning new things. One thing I've learned from the Cloudera/Hortonoworks merger was how amazing CDSW/CML is as a product. CML isn't only for Data Scientists, it's for anyone that needs an IDE (Integrated Development Environment). Coming from a development background working with Eclipse and IntelliJ, you become dependent on a solid IDE. In the past, I've used IDEs to develop applications, and then a build process would eventually deploy to a run environment. This is where CML shines, you're able to run your applications within CDP at scale with enormous amounts of data. Anyone using CML is over the moon running their applications/projects within CML because of the simplicity, as I'll demonstrate below. The tagline of CML is "BYOL" (Bring your own libraries), meaning ALL libraries are welcome (outside of the Cloudera ecosystem). This differentiates Cloudera from others such as native Azure, where native Azure is HIGHLY dependent on all things Microsoft (unless the application owner created something native Azure-specific). I'll demonstrate how easy deploying a third party such as Django "The web framework for perfectionists with deadlines." where the installation/run feels like you're running on your local laptop, instead of a highly scalable IDE that runs anywhere. Remember that CML runs ANYWHERE, within the cloud providers such as Azure, AWS, or GCP, and on-premise. Cloudera abstracts out the complexities. Using CML, we'll go from installing Django and then running Django in a matter of minutes. Step 1: Find the read-only URL to run your embedded application (Django) import os
url=os.environ["CDSW_ENGINE_ID"]+"."+os.environ["CDSW_DOMAIN"]
print("http://read-only-%s"%url) Step 2: Install Django !pip install django Step 3: Create a Django project as instructed here !django-admin startproject mysite
cd mysite Step 4: Modify the settings.py file adding the 'read-only-%s' value from step 1 and localhost ALLOWED_HOSTS = ['localhost','read-only-yourhostnamefromstep1'] Step 5: Run Django !python manage.py runserver localhost:$CDSW_READONLY_PORT That's IT! If you'd like to access your Django page, navigate to the URL in step 1! Nothing specialized for CML, as we say BYOL!
... View more
Labels:
10-04-2021
09:39 PM
Hello @RyanCicak Im trying. this flow but it doesn't work for me. This is my flow What should I do? thanks
... View more