Member since
06-05-2019
128
Posts
133
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1798 | 12-17-2016 08:30 PM | |
1344 | 08-08-2016 07:20 PM | |
2382 | 08-08-2016 03:13 PM | |
2489 | 08-04-2016 02:49 PM | |
2299 | 08-03-2016 06:29 PM |
08-12-2024
01:37 PM
In this article, I'll walk you through a Flink application I developed to process real-time data and write the output to HDFS within the Cloudera Data Platform (CDP). But first, let’s discuss why Flink is a powerful choice for streaming analytics and how Cloudera’s platform can help you make the most of it. Why Flink? Flink excels in scenarios where low-latency processing and real-time analytics are critical. Compared to Spark, Flink often shines in streaming use cases due to its advanced event-time processing and lower latency. However, Spark remains a robust choice for batch processing and when integrating with existing Spark-based pipelines. This flexibility is where Cloudera stands out as the obvious choice for streaming analytics. Cloudera supports both Flink and Spark, giving you the power to choose the right tool for your specific use case. Beyond just tooling, Cloudera’s hybrid platform also allows you to reduce your cloud bill by running applications on-premise, while maintaining the flexibility to run those same applications in the cloud. This makes Cloudera an ideal choice for developers who need a platform that adapts to both on-premise and cloud environments seamlessly. Application Overview Now, let’s dive into the Flink application itself, designed for real-time data processing with three key stages: Reading Data from a Socket: The application starts by connecting to a socket on localhost:10010, continuously streaming in text data line by line. In Flink, this represents the "source" stage of the application. Since the data is read from a single socket connection, the parallelism for this stage is inherently set to 1. This means that while you can configure parallelism when running your application, it won’t impact the source stage because reading from the socket is done only once. Processing Data Using Time Windows: Once the data is ingested, it moves to the "transformation" stage. Here, the application splits the data into individual words, counts each one, and aggregates these counts over a 5-second time window. This stage takes full advantage of Flink's parallel processing capabilities, allowing you to scale the transformations by configuring parallelism as needed. Writing Output to HDFS: Finally, the "target" stage involves writing the processed results to HDFS. One of the major benefits of running this application within Cloudera CDP is that Flink is integrated via the Cloudera Streaming Analytics (CSA) service. This integration means you don't need to worry about configuring connections to HDFS, even with Kerberos enabled out-of-the-box. CDP handles all these configurations for you, making it easier to securely write data to HDFS without additional setup. How to Run This Application in Cloudera CDP Running this Flink application in Cloudera CDP is straightforward. Here’s how you do it: 1. Set Up Your Maven Project: Ensure your Maven project is configured correctly. Use the pom.xml provided earlier to manage dependencies and build the application. Code can be found in GitHub. 2. Build the Application: Use Maven to build your application into a single JAR: mvn clean package It’s important to note that in your pom.xml, the dependencies are marked as provided. This is crucial because Cloudera CDP already loads these dependencies out of the box. By marking them as provided, you ensure that they are not included in the JAR, avoiding any potential conflicts or unnecessary duplication. 3. Upload the JAR to Cloudera CDP: Upload the generated JAR file to your HDFS or S3 storage in Cloudera CDP. Make sure to note the path where you upload the JAR. 4. Run the Flink Application: Execute the following command to run your Flink application on YARN in Cloudera CDP: flink run-application -t yarn-application -p 1 -ynm PortToHDFSFlinkApp PortToHDFSFlinkApp-1.0-SNAPSHOT.jar Here’s a breakdown of the command: -t yarn-application: Specifies that the application should run as a YARN application. -p 1: Sets the parallelism to 1, ensuring that the source stage runs with a single parallel instance. This is critical since the socket connection is inherently single-threaded. -ynm PortToHDFSFlinkApp: Names the application, making it easier to identify in the YARN resource manager. -s hdfs:///path/to/savepoints/savepoint-xxxx: Specifies the path to the savepoint from which the job should resume. (optional) 5. Interact with the Application: Once the application is launched within CDP, you can access the Flink UI to find the node where the source is running on port 10010. After identifying the correct node, you can interact with the application by logging into that node and using the following command: nc -l 10010 This command will start a listener on port 10010, allowing you to type words directly into the terminal. Each word you type, followed by pressing enter/return, will be processed by the Flink application in real-time. This is a simple yet powerful way to test the application's functionality and observe how data flows from the source, through the transformation stage, and finally to the HDFS target. 6. Monitor the Job: While the job is running, you can monitor its progress through the Flink dashboard available in Cloudera CDP. This dashboard provides valuable insights into the job’s performance, including task execution details and resource usage. Conclusion By leveraging Cloudera CDP’s integration of Flink through Cloudera Streaming Analytics, you can easily deploy and manage complex streaming applications without worrying about the underlying configurations—like connecting to HDFS in a Kerberized environment. This PaaS setup simplifies deployment, allowing you to focus on processing and analyzing your data efficiently. With Cloudera’s support for both Flink and Spark, you get the best of both worlds in streaming analytics. Whether you’re leveraging Flink for real-time data processing or Spark for batch jobs, Cloudera guides you to the right tools for your needs and ensures you can implement them with ease.
... View more
06-25-2024
08:53 AM
In this article, we will guide you through detailed, step-by-step instructions on how administrators can create a custom runtime image for notebooks in Cloudera Machine Learning (CML), complete with custom extensions for VsCode. We'll also provide a fully functional runtime image as an example, which you can integrate seamlessly into your CML environment. Advantages: The following steps enable administrators to tailor a VsCode notebook by incorporating all the necessary extensions for end-users. Disadvantages: End-users are unable to permanently install extensions directly within a session; this must be done through the custom runtime image. Consequently, while users can add extensions during an active session, these will be removed once the session ends. However, the extensions included in the runtime image will persist. Custom docker runtime image: rcicakcloudera/vscodeextensions:latest Everything done below is emulated in the custom image above. Step 1: Write Dockerfile Below, you'll notice the installation of an extension named "redhat.vscode-debug-adapter-apache-camel." To install additional extensions, simply continue appending them to the list. FROM docker.repository.cloudera.com/cloudera/cdsw/ml-runtime-jupyterlab-python3.9-standard:2022.04.1-b6
RUN apt update && apt upgrade -y && apt clean && rm -rf /var/lib/apt/lists/*
RUN curl -fsSL https://code-server.dev/install.sh | sh -s -- --version 4.2.0
RUN printf "#!/bin/bash\n/usr/bin/code-server --auth=none --extensions-dir=/usr/bin/custom_extensions --bind-addr=127.0.0.1:8090 --disable-telemetry" > /usr/local/bin/vscode
RUN chmod +x /usr/local/bin/vscode
RUN rm -f /usr/local/bin/ml-runtime-editor
RUN ln -s /usr/local/bin/vscode /usr/local/bin/ml-runtime-editor
user cdsw
RUN mkdir /usr/bin/custom_extensions
RUN code-server --extensions-dir /usr/bin/custom_extensions --install-extension redhat.vscode-debug-adapter-apache-camel
RUN code-server --list-extensions
# Override Runtime label and environment variables metadata
ENV ML_RUNTIME_EDITOR="VsCode" \
ML_RUNTIME_EDITION="v4.2.0" \
ML_RUNTIME_SHORT_VERSION="1.0" \
ML_RUNTIME_MAINTENANCE_VERSION="1" \
ML_RUNTIME_FULL_VERSION="1.0.1" \
ML_RUNTIME_DESCRIPTION="This runtime includes VsCode editor"
LABEL com.cloudera.ml.runtime.editor=$ML_RUNTIME_EDITOR \
com.cloudera.ml.runtime.edition=$ML_RUNTIME_EDITION \
com.cloudera.ml.runtime.full-version=$ML_RUNTIME_FULL_VERSION \
com.cloudera.ml.runtime.short-version=$ML_RUNTIME_SHORT_VERSION \
com.cloudera.ml.runtime.maintenance-version=$ML_RUNTIME_MAINTENANCE_VERSION \
com.cloudera.ml.runtime.description=$ML_RUNTIME_DESCRIPTION Step 2: Build Dockerfile, Tag, and then Push docker build -t d . docker tag 0812eb88e2aa rcicakcloudera/vscodeextensions:latest docker push rcicakcloudera/vscodeextensions:latest Step 3: Specify Docker image within CML Runtime Step 4: Use VsCode Runtime As mentioned previously, only the CML administrator creating this runtime can permanently add extensions. End-users, on the other hand, can only add new extensions temporarily during an active session. If you want to enable end-users to also have the capability to permanently install extensions, consider the following option. Be mindful of these caveats when creating a custom runtime that allows both administrators and end-users to install extensions permanently. Shout-out to @aakulov (Oleksandr Akulov) for coming up with this! a) The initial launch of your custom runtime in a project will require additional time as the extensions are installed in real time. b) A race condition may occur if two users simultaneously initiate their first session using your custom runtime in the same project. RUN printf "#!/bin/bash\n/usr/bin/code-server --auth=none --bind-addr=127.0.0.1:8090 --disable-telemetry" > /usr/local/bin/vscode
RUN printf "code-server --install-extension redhat.vscode-debug-adapter-apache-camel && /usr/local/bin/vscode" > /usr/local/bin/vscodemod
RUN chmod +x /usr/local/bin/vscode
RUN chmod +x /usr/local/bin/vscodemod
RUN rm -f /usr/local/bin/ml-runtime-editor
RUN ln -s /usr/local/bin/vscodemod /usr/local/bin/ml-runtime-editor You'll group all the extensions together in a single location prior to launching the VsCode notebook. The key to this setup is the symlink between vscode and ml-runtime-editor. In our approach, we insert the extension installation before vscode is initiated. This is achieved through a custom file named vscodemod, which handles the installation of the extensions and subsequently triggers the launch of vscode. The purpose of this article is to address the challenges associated with the non-persistent filesystem in the runtime. Any changes made during a session are lost once it ends, as everything reverts to the runtime's original state upon launching a new session. Additionally, the /home/cdsw directory, which is mounted to the EFS, cannot be modified within a runtime. As a result, any files added within the Docker runtime will be erased when a session begins. To circumvent this, administrators have two options: they can permanently add extensions by relocating the extension directory to a specified area within the runtime (outside of the /home/cdsw directory), or they can opt for real-time installation of extensions directly within the /home/cdsw directory, which remains persistent thanks to its EFS mounting. Big shoutouts to @pauldefusco (Paul de Fusco) and @amarinovszki (Arpad Marinovszki) for all your help hashing this solution out!
... View more
Labels:
12-01-2023
12:44 PM
Shaun goes into great depth on all the capabilities of CML and DataViz in this blog, using a Python model or script. Here, I'll write the script in R instead of the sample Python script Shaun shares.
There is no need to have an R script with predictive model capabilities. While an R script with predictive model capabilities is an option, I'll demonstrate deploying a simple R script in CML that adds two columns together.
Step 1: Write your R script
As explained above, this script adds two columns together (assuming both columns are of type integer). Within DataViz, we'll pass two columns (of type integer). The cml_model wrapper is necessary with the PBJ runtime, as explained here.
Notice the "json" is what we'll be returning to DataViz, where we expect the "numbers_added" column
input[1] and input[2] are representing the two integer columns being passed library(cml)
add_numbers <- cml_model(function(args) {
json <-
'{
"data": {
"colnames": [
"numbers_added","tester"
],
"coltypes": [
"INT","STRING"
],
"rows": [
]
}
}
'
mydf <- fromJSON(json)
for(row in 1:nrow(args$data$rows)) {
inputs = args$data$rows[row,]
together <- matrix(list(inputs[1]+inputs[2]))
mydf$data$rows <- rbind(mydf$data$rows, together)
}
mode(mydf$data$rows) <- "integer"
return(mydf)
})
Step 2: Deploy your R script as a Model in CML
Notice PBJ as the runtime
Step 3: Testing your deployed R script/model via CML and/or Postman
Below is the JSON input and output: {
"data": {
"colnames": [
"week",
"depttime"
],
"rows": [
[
1,
7
],
[
2,
8
],
[
11,
55
]
]
}
}
output:
{
"data": {
"colnames": [
"numbers_added",
"tester"
],
"coltypes": [
"INT",
"STRING"
],
"rows": [
[
8
],
[
10
],
[
66
]
]
}
}
*The second column "tester" is required from DataViz (where you need more than one column returned)
Step 4: Build a Dataset in DataViz using your deployed model
Using my deployed model's URL and key, I'll plug this into the built-in function within DataViz, passing two columns: https://modelservice.ml-b74f8940-b97.go01-dem.ylcu-atmi.cloudera.site/model
{"accessKey":"mxyouraccesskey724mext3y8"}
Within DataViz, I'll upload my CSV file called "people_numbers.csv": name,number1,number2
ryan,1,2
nicole,3,4
rupert,5,6
nigel,7,8
Creating a new table based on the CSV
Dataset is created from the people_numbers table:
Step 4: Modify the Dataset with a new column called "numbers_added", which will call the CML R model
Clone one of the measure columns (number1 or number2)
Edit the "Copy of number#" column
Change the column name to "numbers_added"
Go to the expression and enter the following then click "APPLY": cviz_rest('{"url":"https://modelservice.ml-b74f8940-b97.go01-dem.ylcu-atmi.cloudera.site/model","accessKey":"withyouraccesskey","colnames":["number1","number2"],"response_colname":"numbers_added"}')
Validate the numbers_added is defined as a Mes and #, then click "SAVE"
Step 5: Add a dashboard with your dataset
Within the last step 4, click the "NEW DASHBOARD" button
The new column "numbers_added" shows the results from the CML R model deployed within step 4.
We're able to create visuals based on the dimensions and measures we choose
That's it! Feel free to import your own R script into CML as a model, and build charts/graphs within DataViz!
... View more
11-01-2023
12:04 PM
It's all about simplicity and cohesion. With the multiple services within CDP, today I'll focus on finding actively running queries in CDW (Impala) from CML. With the three steps below, you'll be able to find actively running query progress:
Step 1: Find the coordinator URL within CDW (Impala)
Within CDW, go to your Virtual Warehouse and select "Edit":
Within the Virtual Warehouse, go to the "WEB UI" page:
Copy the Coordinator Web UI address, in my example I'll remove the https:// "coordinator-web-default-impala.dw-go01-demo-aws.ylcu-atmi.cloudera.site"
Step 2: Since we're connecting to CDW (Impala) from within CML, I'll set my project's Environment Variables to include my username/password
Setting the WORKLOAD_PASSWORD to my workload password
Setting the variable WORKLOAD_USER to my username
Step 3: Within my CML Notebook (in my case PBJ), I'll copy the following code, replacing the coordinator with the coordinator from Step 1:
import os
import requests
import pandas as pd
from tabulate import tabulate
ic = ['coordinator-web-default-impala.dw-go01-demo-aws.ylcu-atmi.cloudera.site']
for c in ic:
r = requests.get('https://{}/queries?json'.format(c),auth=(os.environ["WORKLOAD_USER"], os.environ["WORKLOAD_PASSWORD"]))
running_queries = r.json()['in_flight_queries']
if len(running_queries) > 0:
df = pd.DataFrame(running_queries)
print(tabulate(df[['progress','query_id','stmt','executing','start_time']],headers='keys',tablefmt='psql'))
I can add more columns if necessary such as 'stmt_type','resource_pool','state','default_db','effective_user'.
It's just that easy!
... View more
07-27-2023
01:34 PM
Description Repo Name: cml_vscode_with_python310 Repo Info: VsCode using Python 3.10, instead of the existing VsCode example using Python 3.7. Repo URL: https://github.com/ryancicak/cml_vscode_with_python310
... View more
Labels:
07-27-2023
10:57 AM
I recently received a question "We'd like to have Python and R installed in the docker image for a package. Is it possible to have such customized docker with python >=3.7, R>=4.0 and CUDA==11.7?" I'd like to go through how I accomplished this using the pbj-workbench-r4.1-standard.Dockerfile. This runtime has the essentials (Python and R) except for CUDA (NVIDIA GPU). Therefore, we'll use pbj-workbench-r4.1-standard.Dockerfile as a base image. It's important to note that CML requires a runtime kernel that needs to be either R or Python, it can't be both. I'll go through three different iterations of this image: Iteration 1 - Since this image already has R and Python installed, we'll change the runtime NOT to use R ryancicak/pbj_r-and-python:v9 -> Python 3.10 (GitHub Tag: python/3.10/standard_dockerhub_v9) https://github.com/ryancicak/pbj_r-and-python/blob/python/3.10/standard_dockerhub_v9/Dockerfile Iteration 2 - Go back to using R (as we did initially before removing R as the runtime kernel) in Iteration 1 ryancicak/pbj_r-and-python:v10 -> R 4.1 (GitHub Tag: r/4.1/standard_dockerhub_v10) https://github.com/ryancicak/pbj_r-and-python/blob/r/4.1/standard_dockerhub_v10/Dockerfile Iteration 3 - Searching the nvidia/cuda dockerhub tags, I found version 11.7.1, with similarities to https://github.com/cloudera/ml-runtimes/blob/public-runtimes/pbj-workbench-python3.10-cuda.Dockerfile : # Copyright 2022 Cloudera. All Rights Reserved.
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
RUN apt-key del 7fa2af80 && apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub to # Copyright 2022 Cloudera. All Rights Reserved.
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04
RUN apt-key del 7fa2af80 && apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub *Notice "-cudnn8-devel-ubuntu20.04" is identical, as well as the following "RUN" line *Validating cuda 11.7 is running If we compare the changes between iteration 1 and 2, it's ever so slight when comparing tags on the Dockerfile. Again, we're only changing the runtime kernel since the image has both R and Python installed. Feel free to try these different iterations in the Docker Hub tags as specified within CML by adding these as new Runtimes: Iteration 1 - ryancicak/pbj_r-and-python:v9 Iteration 2 - ryancicak/pbj_r-and-python:v10 Iteration 3 - ryancicak/pbj_r-and-python:v11 While the versioning is different in CML, they're all basically using the same docker image as asked in the original question above. Shout-out to @amarinovszki for his collaboration! Thanks Árpád!
... View more
Labels:
05-30-2023
12:10 PM
2 Kudos
Have you stumbled upon existing articles on only reading from a managed Hive table using R within CML, but you really want to read/write to an external table (not a Hive Managed table)? Then you've come to the right place!
There are complexities to reading/writing to a Hive-managed table since ACID is involved, as the HWC piece solves. In 2023, if you're looking for ACID capabilities then using Iceberg table format instead of Hive-managed tables is recommended anyways. The Iceberg table format is fully supported throughout CDP.
Let's get on with the steps to easily read/write using R to an external table within CML! Again, without the need for HWC.
Step 1: Created two tables (external) and inserted records
*In this case, I used Cloudera Data Warehouse with an Impala Virtual Warehouse
create database cloudera_r;
create external table cloudera_r.people_table (id int, name string) location 's3a://ryancicak/people_table';
insert into cloudera_r.people_table values (0,'ryan'), (1, 'paul'), (2, 'oleksandr'), (3, 'oliver'), (4,'jd');
create external table cloudera_r.people_new_table (id int, name string) location 's3a://ryancicak/people_new_table';
insert into cloudera_r.people_new_table values (5,'danielle'), (6,'nilesh');
Step 2: Modifying the code in the article above (stripping out HWC) - I'm able to run the code (which I'll include a snippet below so you can copy)
Since the Hive Metastore contains the S3 location from Impala in Step 1, we do not need to specify the S3 location for Spark to read/write.
We need the spark.yarn.access.hadoopFileSystem to contain the S3 location that our table is - so Spark has access to read/write. If you make a modification on spark.yarn.access.hadoopFileSystem - it's best to start a new session.
#CONNECT TO DATA LAKE USING SPARK
#Run this once
#install.packages("sparklyr")
#Add library
library(sparklyr)
library(dplyr)
library(DBI)
#Add spark_config()
spark_config()
config <- spark_config()
#Add configuration
config$spark.security.credentials.hiveserver2.enabled="false"
config$spark.datasource.hive.warehouse.read.via.llap="false"
config$spark.sql.hive.hwc.execution.mode="spark"
config$spark.yarn.access.hadoopFileSystems="s3a://ryancicak/"
#config$spark.datasource.hive.warehouse.read.jdbc.mode="spark"
#Connect with spark
sc <- spark_connect(config = config)
#Change database if necessary
spark_session(sc) %>% invoke("sql", "USE cloudera_r")
peopleframe <- tbl(sc, sql("select * from cloudera_r.people_new_table"))
print(peopleframe)
#spark_write_csv(peopleframe,"s3a://ryancicak/testing_people.csv", header=TRUE)
spark_write_table(peopleframe, "cloudera_r.people_table", mode = "append")
#Read and show table
intDf1 <- sparklyr::spark_read_table(sc, 'people_table')
sparklyr::sdf_collect(intDf1)
peopleframe <- tbl(sc, sql("select * from cloudera_r.people_table"))
print(peopleframe)
Notice that the first print table is showing the two rows, Danielle and Nilesh. The second table print is showing the table people_table after we wrote the two new rows, where instead of five initial records, we show seven (since we appended).
I do print the same table twice - just showing two different ways to read from a table (I prefer the SQL option).
If you do not choose the correct bucket name (the S3 bucket where you're reading/writing tables) - you'll receive an error. In my case, I receive a RAZ 401 error (since I have RAZ enabled). If I change my bucket after running one in my session, I'll still get the error. I need to stop and then start a new session to take in the correct bucket.
I can add multiple S3 locations separated with commas.
Notice that I changed my hadoopFileSystems to "differents3bucket" and receive the access error. Even if I change back to "ryancicak", I still receive an access error until I start a new session.
@pauldefusco added some more examples to his GitHub repo that may be helpful: SparklyR_CML_Example.
Shout-out to @pauldefusco, @ozarate, @aakulov, and @jagadeesan for helping with these steps!
... View more
05-09-2023
02:05 PM
In a recent release, CDW is now fully integrated with RAZ. This means once you provision an environment in CDP with RAZ enabled, adding an S3 bucket into CDW follows the same principles as the services within CDP (CML, CDE, COD, CDF, and DataHub).
I was able to add a new S3 bucket called "ryancicak" in two steps, after receiving the 403 Forbidden error within Hue:
Two steps:
Step 1 - Since I used the -datalake-admin-role for RAZ, I modified the "datalake-admin-s3-policy to include my new bucket "ryancicak" with the two lines:
"arn:aws:s3:::ryancicak", "arn:aws:s3:::ryancicak/*"
*Notice I also removed the subdirectory from my cicak-cdp-stack to include the root directory. This is important to make this modification as well. Instead of "cicak-cdp-stack/my-data", I state "cicak-cdp-stack"
Step 2 - I added a new policy in Ranger under cm_s3 for my bucket "ryancicak", for the users rcicak (me), and Hive.
DONE - When re-running the create table using the bucket "ryancicak", no more errors! The table "rupert" is now reading/writing to the "ryancicak" bucket. Since I have my Ranger (RAZ) policy created, from step 2, I'm able to access this bucket from CDP (CML, CDE, COD, CDF, and DataHub).
Note: I used the default "-datalake-admin-role" that was created through the CloudFormation script in the AWS Quickstart. In my case, I named the stack "cicak-cdp-stack" in the CloudFormation script.
That was EASY:
... View more
03-14-2023
02:56 PM
1 Kudo
Special shout-out to @zoram, who provided these instructions. This article will take you through these steps (executing each step with a screenshot). This article specifies an edge case that involves processing large amounts of non-tabular datasets such as images. The data lake is the preferred storage for CML as object storage can scale to billions of objects with tabular data. If you qualify for this edge case (non-tabular datasets), please delete the dataset from EFS/NFS as soon as possible (after processing), since large amounts of data on EFS/NFS negatively affect backup and recovery times. When it comes to non-tabular data, you may not want to use object storage with standard readers such as Boto3 (for AWS), which has more latency attached to each read. If we move to a faster I/O such as EFS, which is already used by your CML workspace, we can load data to your existing CML workspace EFS. It's a simple process that we'll take you through: Step 1 - When dealing with a large number of files, switch your CML workspace to provisioned throughput mode from the AWS EFS console. Increase this to what you're willing to pay for (I/O) and keep it provisioned throughput until you're done processing the files in EFS. Step 2 - Login to the CML UI, and launch a session in the project. The project you'd like to process files within your workspace's EFS. Step 3 - Within your session, launch the "Terminal Access" Step 4 - Within your Terminal Session, run the following command and write down the output, as you'll need the output later. You can stop your session (it's only for this step). df -h | grep home Step 5 - Using kubectl, you'll determine which node is running ds-vfs. In order to access kubectl on your laptop, you'll need to download the Kubeconfig for the workspace you're accessing. Run the following command: kubectl get pods -n mlx -o wide | grep ds-vfs (In case you need to download the Kubeconfig on the workspace): Step 6 - In case you do not have a security policy in place to access SSH on the node you found in step 3, you'll want to add a security policy (to SSH into the node). After you're done with these steps, feel free to remove this security policy (to SSH into the ds-vfs node. Step 7 - SSH into the node from steps 3 & 4 Step 8 - Determine the EFS mount point, and then sudo su (so you're root moving forward) cat /proc/mounts | grep projects-share | grep -v grafana
sudo su My output: 127.0.0.1:/ /var/lib/kubelet/pods/eb4b15b6-631a-428b-92df-e0d31074e7f9/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20081,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/d3c5b66c-c84a-40cc-bed9-79fb95f3a7ad/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20246,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/1ff3a121-e7fb-4475-974f-92446f65a773/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20316,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/96980148-9630-4f5e-90b6-05970eacdf1f/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20439,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0 Step 9 - Go to the projects file system in my case (you'll pull the last piece from your output in step 3) cd /var/lib/kubelet/pods/eb4b15b6-631a-428b-92df-e0d31074e7f9/volumes/kubernetes.io~csi/projects-share/mount/projects/0/1 Step 10 - Create a new folder that you'll use to load your files mkdir images Step 11 - Validate the directory permissions are set correctly. Files and directories must be owned by 8536:8536 ls -l | grep images output: drwxr-sr-x 2 8536 8536 6144 Mar 13 21:01 images IMPORTANT: If you're loading MANY files - it's beyond important to add this new directory to the gitignore - so these files don't get committed to GIT! Step 12 - Copy your files into the new directory from step 7, from S3. You'll likely want to figure out S3 authn/authz for the copy. It would be a best practice to script this copy and run it with nohup as a background command. That way the copy doesn't get terminated if the SSH session times out. Step 13 - Your files are now available in the new directory. IMPORTANT: It's general best practice to not load all your files in one directory. Instead, create multiple subdirectories and load say 10k files per subdirectory. Within your code (say Python), you can iterate subdirectories and process the files. This way, you aren't overwhelming your code - loading all files at once! Warning: Do the cost estimate on using EFS vs S3 - before going into sticker shock on using higher-cost storage within AWS. Again - after you're done processing your data within EFS/NFS, please remember to delete this data since large amounts of data on EFS/NFS negatively affect backup and recovery times.
... View more
Labels:
02-28-2023
12:58 PM
2 Kudos
If you've ever wanted to automate CDP, the CDP CLI is the tool to use. Anything from pulling Virtual Warehouse details to spinning up a new Virtual Warehouse, the CDP CLI is a powerful tool. Not only Data Warehouse driven, the CDP CLI encompasses all pieces of CDP.
CML is an IDE for all things. Working on a project from within CML should be a simple and straightforward process. If you're creating an application that will interact with CDP in one way or another, the CDP CLI can be installed in less than sixty seconds. Once installed within your CML Project, you'll have the full power of the CDP CLI at your fingertips.
Step 1: Within your session launch the terminal access
Step 2: Within your session Terminal, type pip install cdpcli and click enter
pip install cdpcli
Step 3: Generate an API Access Key (if you don't already have one)
https://docs.cloudera.com/cdp-public-cloud/cloud/cli/topics/mc-cli-generating-an-api-access-key.html
You'll receive a cdp_access_key_id and cdp_private_key
Step 4: Within your session Terminal, type cdp configure
cdp configure
You'll input your CDP Access Key ID and CDP Private Key, you can leave the CDP Region [None] and CDP Endpoint URL [None]
That's it! Now you can run all the CDP CLI commands within your CML Project! As I said, you're done in sixty seconds!
... View more
Labels: