Member since 
    
	
		
		
		06-05-2019
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                128
            
            
                Posts
            
        
                133
            
            
                Kudos Received
            
        
                11
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 2335 | 12-17-2016 08:30 PM | |
| 1774 | 08-08-2016 07:20 PM | |
| 2880 | 08-08-2016 03:13 PM | |
| 3034 | 08-04-2016 02:49 PM | |
| 2887 | 08-03-2016 06:29 PM | 
			
    
	
		
		
		08-12-2024
	
		
		01:37 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							  In this article, I'll walk you through a Flink application I developed to process real-time data and write the output to HDFS within the Cloudera Data Platform (CDP). But first, let’s discuss why Flink is a powerful choice for streaming analytics and how Cloudera’s platform can help you make the most of it.    Why Flink?  Flink excels in scenarios where low-latency processing and real-time analytics are critical. Compared to Spark, Flink often shines in streaming use cases due to its advanced event-time processing and lower latency. However, Spark remains a robust choice for batch processing and when integrating with existing Spark-based pipelines.  This flexibility is where Cloudera stands out as the obvious choice for streaming analytics. Cloudera supports both Flink and Spark, giving you the power to choose the right tool for your specific use case. Beyond just tooling, Cloudera’s hybrid platform also allows you to reduce your cloud bill by running applications on-premise, while maintaining the flexibility to run those same applications in the cloud. This makes Cloudera an ideal choice for developers who need a platform that adapts to both on-premise and cloud environments seamlessly.    Application Overview  Now, let’s dive into the Flink application itself, designed for real-time data processing with three key stages:    Reading Data from a Socket: The application starts by connecting to a socket on localhost:10010, continuously streaming in text data line by line. In Flink, this represents the "source" stage of the application. Since the data is read from a single socket connection, the parallelism for this stage is inherently set to 1. This means that while you can configure parallelism when running your application, it won’t impact the source stage because reading from the socket is done only once.    Processing Data Using Time Windows: Once the data is ingested, it moves to the "transformation" stage. Here, the application splits the data into individual words, counts each one, and aggregates these counts over a 5-second time window. This stage takes full advantage of Flink's parallel processing capabilities, allowing you to scale the transformations by configuring parallelism as needed.    Writing Output to HDFS: Finally, the "target" stage involves writing the processed results to HDFS. One of the major benefits of running this application within Cloudera CDP is that Flink is integrated via the Cloudera Streaming Analytics (CSA) service. This integration means you don't need to worry about configuring connections to HDFS, even with Kerberos enabled out-of-the-box. CDP handles all these configurations for you, making it easier to securely write data to HDFS without additional setup.      How to Run This Application in Cloudera CDP  Running this Flink application in Cloudera CDP is straightforward. Here’s how you do it:  1. Set Up Your Maven Project:  Ensure your Maven project is configured correctly. Use the pom.xml provided earlier to manage dependencies and build the application.  Code can be found in GitHub.  2. Build the Application:  Use Maven to build your application into a single JAR:  mvn clean package  It’s important to note that in your pom.xml, the dependencies are marked as provided. This is crucial because Cloudera CDP already loads these dependencies out of the box. By marking them as provided, you ensure that they are not included in the JAR, avoiding any potential conflicts or unnecessary duplication.  3. Upload the JAR to Cloudera CDP:  Upload the generated JAR file to your HDFS or S3 storage in Cloudera CDP. Make sure to note the path where you upload the JAR.  4. Run the Flink Application:  Execute the following command to run your Flink application on YARN in Cloudera CDP:  flink run-application -t yarn-application -p 1 -ynm PortToHDFSFlinkApp PortToHDFSFlinkApp-1.0-SNAPSHOT.jar  Here’s a breakdown of the command:   -t yarn-application: Specifies that the application should run as a YARN application.  -p 1: Sets the parallelism to 1, ensuring that the source stage runs with a single parallel instance. This is critical since the socket connection is inherently single-threaded.  -ynm PortToHDFSFlinkApp: Names the application, making it easier to identify in the YARN resource manager.  -s hdfs:///path/to/savepoints/savepoint-xxxx: Specifies the path to the savepoint from which the job should resume. (optional)   5. Interact with the Application:  Once the application is launched within CDP, you can access the Flink UI to find the node where the source is running on port 10010. After identifying the correct node, you can interact with the application by logging into that node and using the following command:  nc -l 10010  This command will start a listener on port 10010, allowing you to type words directly into the terminal. Each word you type, followed by pressing enter/return, will be processed by the Flink application in real-time. This is a simple yet powerful way to test the application's functionality and observe how data flows from the source, through the transformation stage, and finally to the HDFS target.  6. Monitor the Job:  While the job is running, you can monitor its progress through the Flink dashboard available in Cloudera CDP. This dashboard provides valuable insights into the job’s performance, including task execution details and resource usage.    Conclusion  By leveraging Cloudera CDP’s integration of Flink through Cloudera Streaming Analytics, you can easily deploy and manage complex streaming applications without worrying about the underlying configurations—like connecting to HDFS in a Kerberized environment. This PaaS setup simplifies deployment, allowing you to focus on processing and analyzing your data efficiently.  With Cloudera’s support for both Flink and Spark, you get the best of both worlds in streaming analytics. Whether you’re leveraging Flink for real-time data processing or Spark for batch jobs, Cloudera guides you to the right tools for your needs and ensures you can implement them with ease.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-25-2024
	
		
		08:53 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 In this article, we will guide you through detailed, step-by-step instructions on how administrators can create a custom runtime image for notebooks in Cloudera Machine Learning (CML), complete with custom extensions for VsCode.  We'll also provide a fully functional runtime image as an example, which you can integrate seamlessly into your CML environment.      Advantages: The following steps enable administrators to tailor a VsCode notebook by incorporating all the necessary extensions for end-users.  Disadvantages: End-users are unable to permanently install extensions directly within a session; this must be done through the custom runtime image. Consequently, while users can add extensions during an active session, these will be removed once the session ends. However, the extensions included in the runtime image will persist.      Custom docker runtime image:     rcicakcloudera/vscodeextensions:latest     Everything done below is emulated in the custom image above.  Step 1: Write Dockerfile  Below, you'll notice the installation of an extension named "redhat.vscode-debug-adapter-apache-camel." To install additional extensions, simply continue appending them to the list.      FROM docker.repository.cloudera.com/cloudera/cdsw/ml-runtime-jupyterlab-python3.9-standard:2022.04.1-b6
RUN apt update && apt upgrade -y && apt clean && rm -rf /var/lib/apt/lists/*
RUN curl -fsSL https://code-server.dev/install.sh | sh -s --  --version 4.2.0
RUN printf "#!/bin/bash\n/usr/bin/code-server --auth=none --extensions-dir=/usr/bin/custom_extensions --bind-addr=127.0.0.1:8090 --disable-telemetry" > /usr/local/bin/vscode
RUN chmod +x /usr/local/bin/vscode
RUN rm -f /usr/local/bin/ml-runtime-editor
RUN ln -s /usr/local/bin/vscode /usr/local/bin/ml-runtime-editor
user cdsw
RUN mkdir /usr/bin/custom_extensions
RUN code-server --extensions-dir /usr/bin/custom_extensions --install-extension redhat.vscode-debug-adapter-apache-camel
RUN code-server --list-extensions
# Override Runtime label and environment variables metadata
ENV ML_RUNTIME_EDITOR="VsCode" \
            ML_RUNTIME_EDITION="v4.2.0" \
                ML_RUNTIME_SHORT_VERSION="1.0" \
        ML_RUNTIME_MAINTENANCE_VERSION="1" \
    ML_RUNTIME_FULL_VERSION="1.0.1" \
    ML_RUNTIME_DESCRIPTION="This runtime includes VsCode editor"
LABEL com.cloudera.ml.runtime.editor=$ML_RUNTIME_EDITOR \
      com.cloudera.ml.runtime.edition=$ML_RUNTIME_EDITION \
          com.cloudera.ml.runtime.full-version=$ML_RUNTIME_FULL_VERSION \
      com.cloudera.ml.runtime.short-version=$ML_RUNTIME_SHORT_VERSION \
      com.cloudera.ml.runtime.maintenance-version=$ML_RUNTIME_MAINTENANCE_VERSION \
      com.cloudera.ml.runtime.description=$ML_RUNTIME_DESCRIPTION        Step 2: Build Dockerfile, Tag, and then Push     docker build -t d .  docker tag 0812eb88e2aa rcicakcloudera/vscodeextensions:latest        docker push rcicakcloudera/vscodeextensions:latest        Step 3: Specify Docker image within CML Runtime  Step 4: Use VsCode Runtime     As mentioned previously, only the CML administrator creating this runtime can permanently add extensions. End-users, on the other hand, can only add new extensions temporarily during an active session. If you want to enable end-users to also have the capability to permanently install extensions, consider the following option. Be mindful of these caveats when creating a custom runtime that allows both administrators and end-users to install extensions permanently.  Shout-out to @aakulov (Oleksandr Akulov) for coming up with this!  a) The initial launch of your custom runtime in a project will require additional time as the extensions are installed in real time.  b) A race condition may occur if two users simultaneously initiate their first session using your custom runtime in the same project.     RUN printf "#!/bin/bash\n/usr/bin/code-server --auth=none --bind-addr=127.0.0.1:8090 --disable-telemetry" > /usr/local/bin/vscode
RUN printf "code-server --install-extension redhat.vscode-debug-adapter-apache-camel && /usr/local/bin/vscode" > /usr/local/bin/vscodemod
RUN chmod +x /usr/local/bin/vscode
RUN chmod +x /usr/local/bin/vscodemod
RUN rm -f /usr/local/bin/ml-runtime-editor
RUN ln -s /usr/local/bin/vscodemod /usr/local/bin/ml-runtime-editor        You'll group all the extensions together in a single location prior to launching the VsCode notebook. The key to this setup is the symlink between vscode and ml-runtime-editor. In our approach, we insert the extension installation before vscode is initiated. This is achieved through a custom file named vscodemod, which handles the installation of the extensions and subsequently triggers the launch of vscode.  The purpose of this article is to address the challenges associated with the non-persistent filesystem in the runtime. Any changes made during a session are lost once it ends, as everything reverts to the runtime's original state upon launching a new session. Additionally, the /home/cdsw directory, which is mounted to the EFS, cannot be modified within a runtime. As a result, any files added within the Docker runtime will be erased when a session begins. To circumvent this, administrators have two options: they can permanently add extensions by relocating the extension directory to a specified area within the runtime (outside of the /home/cdsw directory), or they can opt for real-time installation of extensions directly within the /home/cdsw directory, which remains persistent thanks to its EFS mounting.   Big shoutouts to @pauldefusco (Paul de Fusco) and @amarinovszki (Arpad Marinovszki) for all your help hashing this solution out!          
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		12-01-2023
	
		
		12:44 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Shaun goes into great depth on all the capabilities of CML and DataViz in this blog, using a Python model or script.  Here, I'll write the script in R instead of the sample Python script Shaun shares. 
 There is no need to have an R script with predictive model capabilities.  While an R script with predictive model capabilities is an option, I'll demonstrate deploying a simple R script in CML that adds two columns together. 
 Step 1: Write your R script 
 As explained above, this script adds two columns together (assuming both columns are of type integer).  Within DataViz, we'll pass two columns (of type integer).  The cml_model wrapper is necessary with the PBJ runtime, as explained here. 
 
 Notice the "json" is what we'll be returning to DataViz, where we expect the "numbers_added" column 
 input[1] and input[2] are representing the two integer columns being passed library(cml)
    
add_numbers <- cml_model(function(args) {
    json <-
'{
    "data": {
        "colnames": [
            "numbers_added","tester"
        ],
        "coltypes": [
            "INT","STRING"
        ],
        "rows": [
        ]
    }
}
'
    
  mydf <- fromJSON(json)
  
  for(row in 1:nrow(args$data$rows)) {
        inputs = args$data$rows[row,]
        together <- matrix(list(inputs[1]+inputs[2]))
        mydf$data$rows <- rbind(mydf$data$rows, together)
  }
  
  mode(mydf$data$rows) <- "integer"
 
  return(mydf)
    
    
    })  
 
   
   
 Step 2: Deploy your R script as a Model in CML 
 
 Notice PBJ as the runtime   
 
 Step 3: Testing your deployed R script/model via CML and/or Postman 
 
 Below is the JSON input and output: {
  "data": {
    "colnames": [
      "week",
      "depttime"
    ],
    "rows": [
      [
        1,
        7
      ],
      [
        2,
        8
      ],
      [
        11,
        55
      ]
    ]
  }
} 
 output: 
 {
    "data": {
        "colnames": [
            "numbers_added",
            "tester"
        ],
        "coltypes": [
            "INT",
            "STRING"
        ],
        "rows": [
            [
                8
            ],
            [
                10
            ],
            [
                66
            ]
        ]
    }
} 
      
 *The second column "tester" is required from DataViz (where you need more than one column returned) 
 
 
 Step 4: Build a Dataset in DataViz using your deployed model 
 
 Using my deployed model's URL and key, I'll plug this into the built-in function within DataViz, passing two columns:  https://modelservice.ml-b74f8940-b97.go01-dem.ylcu-atmi.cloudera.site/model
{"accessKey":"mxyouraccesskey724mext3y8"}  
  Within DataViz, I'll upload my CSV file called "people_numbers.csv": name,number1,number2
ryan,1,2
nicole,3,4
rupert,5,6
nigel,7,8  
 
   
   
 
 Creating a new table based on the CSV         
 Dataset is created from the people_numbers table:     
 
 Step 4: Modify the Dataset with a new column called "numbers_added", which will call the CML R model 
 
  Clone one of the measure columns (number1 or number2)         
 Edit the "Copy of number#" column   
 Change the column name to "numbers_added"   
 Go to the expression and enter the following then click "APPLY": cviz_rest('{"url":"https://modelservice.ml-b74f8940-b97.go01-dem.ylcu-atmi.cloudera.site/model","accessKey":"withyouraccesskey","colnames":["number1","number2"],"response_colname":"numbers_added"}') 
    
 
 Validate the numbers_added is defined as a Mes and #, then click "SAVE"   
 
 Step 5: Add a dashboard with your dataset 
 
 Within the last step 4, click the "NEW DASHBOARD" button   
 The new column "numbers_added" shows the results from the CML R model deployed within step 4.    
 We're able to create visuals based on the dimensions and measures we choose   
 
 That's it!  Feel free to import your own R script into CML as a model, and build charts/graphs within DataViz! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-01-2023
	
		
		12:04 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 It's all about simplicity and cohesion.  With the multiple services within CDP, today I'll focus on finding actively running queries in CDW (Impala) from CML.  With the three steps below, you'll be able to find actively running query progress: 
 Step 1: Find the coordinator URL within CDW (Impala) 
 
 Within CDW, go to your Virtual Warehouse and select "Edit":   
 Within the Virtual Warehouse, go to the "WEB UI" page:   
 Copy the Coordinator Web UI address, in my example I'll remove the https:// "coordinator-web-default-impala.dw-go01-demo-aws.ylcu-atmi.cloudera.site"      
 
 Step 2: Since we're connecting to CDW (Impala) from within CML, I'll set my project's Environment Variables to include my username/password 
 
 Setting the WORKLOAD_PASSWORD to my workload password 
 Setting the variable WORKLOAD_USER to my username   
 
 Step 3: Within my CML Notebook (in my case PBJ), I'll copy the following code, replacing the coordinator with the coordinator from Step 1: 
   
 import os
import requests
import pandas as pd
from tabulate import tabulate
ic = ['coordinator-web-default-impala.dw-go01-demo-aws.ylcu-atmi.cloudera.site']
for c in ic:
    r = requests.get('https://{}/queries?json'.format(c),auth=(os.environ["WORKLOAD_USER"], os.environ["WORKLOAD_PASSWORD"]))
    running_queries = r.json()['in_flight_queries']
    
    if len(running_queries) > 0:
        df = pd.DataFrame(running_queries)
        print(tabulate(df[['progress','query_id','stmt','executing','start_time']],headers='keys',tablefmt='psql')) 
   
    
 I can add more columns if necessary such as 'stmt_type','resource_pool','state','default_db','effective_user'.  
 It's just that easy! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-27-2023
	
		
		01:34 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							   Description   Repo Name: cml_vscode_with_python310    Repo Info: VsCode using Python 3.10, instead of the existing VsCode example using Python 3.7.     Repo URL: https://github.com/ryancicak/cml_vscode_with_python310          
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		07-27-2023
	
		
		10:57 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							  I recently received a question "We'd like to have Python and R installed in the docker image for a package.  Is it possible to have such customized docker with python >=3.7, R>=4.0 and CUDA==11.7?"     I'd like to go through how I accomplished this using the pbj-workbench-r4.1-standard.Dockerfile.  This runtime has the essentials (Python and R) except for CUDA (NVIDIA GPU).  Therefore, we'll use pbj-workbench-r4.1-standard.Dockerfile as a base image.  It's important to note that CML requires a runtime kernel that needs to be either R or Python, it can't be both.  I'll go through three different iterations of this image:     Iteration 1 - Since this image already has R and Python installed, we'll change the runtime NOT to use R  ryancicak/pbj_r-and-python:v9 -> Python 3.10 (GitHub Tag: python/3.10/standard_dockerhub_v9)  https://github.com/ryancicak/pbj_r-and-python/blob/python/3.10/standard_dockerhub_v9/Dockerfile     Iteration 2 - Go back to using R (as we did initially before removing R as the runtime kernel) in Iteration 1  ryancicak/pbj_r-and-python:v10 -> R 4.1 (GitHub Tag: r/4.1/standard_dockerhub_v10)  https://github.com/ryancicak/pbj_r-and-python/blob/r/4.1/standard_dockerhub_v10/Dockerfile     Iteration 3 - Searching the nvidia/cuda dockerhub tags, I found version 11.7.1, with similarities to   https://github.com/cloudera/ml-runtimes/blob/public-runtimes/pbj-workbench-python3.10-cuda.Dockerfile  :      # Copyright 2022 Cloudera. All Rights Reserved.
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
RUN apt-key del 7fa2af80 && apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub      to     # Copyright 2022 Cloudera. All Rights Reserved.
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04
RUN apt-key del 7fa2af80 && apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub     *Notice "-cudnn8-devel-ubuntu20.04" is identical, as well as the following "RUN" line      *Validating cuda 11.7 is running     If we compare the changes between iteration 1 and 2, it's ever so slight when comparing tags on the Dockerfile.  Again, we're only changing the runtime kernel since the image has both R and Python installed.  Feel free to try these different iterations in the Docker Hub tags as specified within CML by adding these as new Runtimes:     Iteration 1 - ryancicak/pbj_r-and-python:v9  Iteration 2 - ryancicak/pbj_r-and-python:v10  Iteration 3 - ryancicak/pbj_r-and-python:v11     While the versioning is different in CML, they're all basically using the same docker image as asked in the original question above.       Shout-out to @amarinovszki for his collaboration!  Thanks Árpád!    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		05-31-2023
	
		
		07:29 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Cloudera making hard stuff easy again!   Great article Ryan!! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-09-2023
	
		
		02:05 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 
 In a recent release, CDW is now fully integrated with RAZ.  This means once you provision an environment in CDP with RAZ enabled, adding an S3 bucket into CDW follows the same principles as the services within CDP (CML, CDE, COD, CDF, and DataHub).   
 I was able to add a new S3 bucket called "ryancicak" in two steps, after receiving the 403 Forbidden error within Hue: 
    
 Two steps: 
   
 Step 1 - Since I used the -datalake-admin-role for RAZ, I modified the "datalake-admin-s3-policy to include my new bucket "ryancicak" with the two lines: 
 "arn:aws:s3:::ryancicak",  "arn:aws:s3:::ryancicak/*" 
   
 *Notice I also removed the subdirectory from my cicak-cdp-stack to include the root directory.  This is important to make this modification as well.  Instead of "cicak-cdp-stack/my-data", I state "cicak-cdp-stack" 
  
 Step 2 - I added a new policy in Ranger under cm_s3 for my bucket "ryancicak", for the users rcicak (me), and Hive.   
    
   
 DONE - When re-running the create table using the bucket "ryancicak", no more errors!  The table "rupert" is now reading/writing to the "ryancicak" bucket.  Since I have my Ranger (RAZ) policy created, from step 2, I'm able to access this bucket from CDP (CML, CDE, COD, CDF, and DataHub). 
  
   
 Note: I used the default "-datalake-admin-role" that was created through the CloudFormation script in the AWS Quickstart.  In my case, I named the stack "cicak-cdp-stack" in the CloudFormation script.   
  
   
 That was EASY: 
    
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-14-2023
	
		
		02:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							  Special shout-out to @zoram, who provided these instructions.  This article will take you through these steps (executing each step with a screenshot).  This article specifies an edge case that involves processing large amounts of non-tabular datasets such as images.  The data lake is the preferred storage for CML as object storage can scale to billions of objects with tabular data.  If you qualify for this edge case (non-tabular datasets), please delete the dataset from EFS/NFS as soon as possible (after processing), since large amounts of data on EFS/NFS negatively affect backup and recovery times.     When it comes to non-tabular data, you may not want to use object storage with standard readers such as Boto3 (for AWS), which has more latency attached to each read.  If we move to a faster I/O such as EFS, which is already used by your CML workspace, we can load data to your existing CML workspace EFS.  It's a simple process that we'll take you through:     Step 1 -  When dealing with a large number of files, switch your CML workspace to provisioned throughput mode from the AWS EFS console.  Increase this to what you're willing to pay for (I/O) and keep it provisioned throughput until you're done processing the files in EFS.     Step 2 - Login to the CML UI, and launch a session in the project.  The project you'd like to process files within your workspace's EFS.           Step 3 - Within your session, launch the "Terminal Access"         Step 4 - Within your Terminal Session, run the following command and write down the output, as you'll need the output later.  You can stop your session (it's only for this step).      df -h | grep home                     Step 5 - Using kubectl, you'll determine which node is running ds-vfs.  In order to access kubectl on your laptop, you'll need to download the Kubeconfig for the workspace you're accessing.       Run the following command:     kubectl get pods -n mlx -o wide | grep ds-vfs                 (In case you need to download the Kubeconfig on the workspace):         Step 6 - In case you do not have a security policy in place to access SSH on the node you found in step 3, you'll want to add a security policy (to SSH into the node).  After you're done with these steps, feel free to remove this security policy (to SSH into the ds-vfs node.       Step 7 - SSH into the node from steps 3 & 4     Step 8 - Determine the EFS mount point, and then sudo su (so you're root moving forward)     cat /proc/mounts | grep projects-share | grep -v grafana
sudo su                 My output:     127.0.0.1:/ /var/lib/kubelet/pods/eb4b15b6-631a-428b-92df-e0d31074e7f9/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20081,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/d3c5b66c-c84a-40cc-bed9-79fb95f3a7ad/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20246,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/1ff3a121-e7fb-4475-974f-92446f65a773/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20316,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0
127.0.0.1:/ /var/lib/kubelet/pods/96980148-9630-4f5e-90b6-05970eacdf1f/volumes/kubernetes.io~csi/projects-share/mount nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20439,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0                    Step 9 - Go to the projects file system in my case (you'll pull the last piece from your output in step 3)     cd /var/lib/kubelet/pods/eb4b15b6-631a-428b-92df-e0d31074e7f9/volumes/kubernetes.io~csi/projects-share/mount/projects/0/1                 Step 10 - Create a new folder that you'll use to load your files      mkdir images           Step 11 - Validate the directory permissions are set correctly.  Files and directories must be owned by 8536:8536     ls -l | grep images        output:     drwxr-sr-x 2 8536 8536   6144 Mar 13 21:01 images        IMPORTANT: If you're loading MANY files - it's beyond important to add this new directory to the gitignore - so these files don't get committed to GIT!     Step 12 - Copy your files into the new directory from step 7, from S3.  You'll likely want to figure out S3 authn/authz for the copy.  It would be a best practice to script this copy and run it with nohup as a background command.  That way the copy doesn't get terminated if the SSH session times out.       Step 13 - Your files are now available in the new directory.    IMPORTANT: It's general best practice to not load all your files in one directory.  Instead, create multiple subdirectories and load say 10k files per subdirectory.  Within your code (say Python), you can iterate subdirectories and process the files.  This way, you aren't overwhelming your code - loading all files at once!     Warning: Do the cost estimate on using EFS vs S3 - before going into sticker shock on using higher-cost storage within AWS.  Again - after you're done processing your data within EFS/NFS, please remember to delete this data since large amounts of data on EFS/NFS negatively affect backup and recovery times.    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		03-01-2023
	
		
		04:14 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Nice and Quick!  Excellent! 
						
					
					... View more