Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
11-14-2019
08:14 AM
2 Kudos
CDP has recently been introduced to the market, and we all want to learn about it, so I decided to contribute a bit.
In this series of tutorials, I want to explain the basics of CDP Data Hub and provide some ease of automation. Indeed, the CDP control plane user interface provides a thorough workflow that automates the creation of main elements that ensures security, governance and scalability.
As depicted above, these main elements are:
CDP Environment: Cloud hosted (in your cloud) resources for CDP deployment
CDP Data Lake: Host of CDP SDX, the shared service layer providing all security and governance
CDP Data Hub cluster(s): where users run their workloads
Note: CDP offers a lot more than these basic elements, and a lot more ways to configure them; this is a 101 tutorial and will not address all that CDP has to offer. For more information, visit the CDP documentation or product page.
Instead, this tutorial series will teach you:
How to create a CDP environment in AWS with minimal requirements
How to create a datalake from an existing environment
How to launch a CDP Data Hub cluster via CLI
Happy scripting!
... View more
Labels:
11-14-2019
07:48 AM
Introduction
Continuing my previous article on creating a CDP AWS environment, and a CDP data lake, this tutorial teaches you how to automate the creation of a simple data engineering data hub cluster. You'll notice that once a data lake is setup, launching data hub clusters is very easy!
The cluster generated has the following properties:
Template: CDP 1.1 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie
Nodes:
1 Master m5.2xlarge
3 Workers m5.2xlarge
Here is the TL;DR: go to my github and run the scripts as instructed.
Automation scripts
Step 1: Create Data Hub Cluster
cdp_create_dh_de.sh <prefix>
Step 2: Verify periodically until cluster status is AVAILABLE
cdp_describe_dh_de.sh <prefix>
That's it!
... View more
Labels:
11-14-2019
05:44 AM
Introduction Continuing my previous article on creating a CDP AWS environment, this tutorial teaches you how to automate the creation of a datalake, including: Setting up proper roles and IDBroker mappings associated with your user and environment Creating a datalake Syncing users to FreeIPA Here is the TL;DR: go to my github and run the scripts as instructed. Automation scripts Step 1: Create IAM and launch Data Lake Create roles and mapping in your existing environment: cdp_create_group_iam.sh <base_dir> <prefix> <region> Create datalake: cdp_create_datalake.sh <base_dir> <prefix> Step 2: Verify periodically until datalake status is RUNNING cdp_describe_dl.sh <prefix> Step 3: Sync free IPA users cdp_sync_users.sh <base_dir> <prefix> Conclusion This was a short and sweet tutorial, more fun to come playing with data lake clusters and experiences!
... View more
Labels:
11-13-2019
04:21 PM
4 Kudos
Introduction Cloudera Data Platform (CDP) for public cloud has an amazing admin UI that drives you through a thorough wizards for setting up environment, data lakes, data hub clusters and experiences. Details of AWS pre-requisites can be found in CDP official documentation, but why use the easy button when you can leverage AWS and CDP cli and do it the hard way? My thoughts exactly. Here is the TL;DR: go to my github and run the scripts as instructed. AWS minimal requirements for CDP environment Even through cli, a lot of the configuration of a environment can be automated, including: Network (VPCs, routing, gateways, etc.) Security Groups Free IPA instance(s) However, you will need to set this minimal set of elements to get an environment ready: In AWS Public/Private keys (not automated/demonstrated here) S3 Bucket AWS Roles AWS policies In CDP Credential (not automated/demonstrated here) To better understand how roles, policies and bucket interact, you can refer to this diagram: In this tutorial however, I'm not going to worry about additional roles but the dataake_admin_role and ranger_audit_role (the other ones are optional). Automation scripts Step 1: Pre-Requisites AWS Install and configure the AWS cli with your account: link Create a public/private key pair: link CDP Create CDP credential for AWS: link Install and configure CDP CLI: link Local computer Clone my github repository git clone https://github.com/paulvid/cdp_create_env_aws.git Step 2: Running the scripts Create AWS S3 bucket: aws_create_bucket.sh <base_dir> <prefix> <region> Purge AWS policies and roles (optional): aws_purge_roles_policies.sh <base_dir> <prefix> Create AWS policies: aws_create_policies.sh <base_dir> <prefix> Create AWS roles: aws_create_roles.sh <base_dir> <prefix> <bucket> Create CDP environment: cdp_create_env.sh <base_dir> <prefix> <credential> <region> <key> Step 3: Verify periodically until environment status is AVAILABLE cdp_describe_env.sh <prefix> Conclusion Obviously this is just getting us started. I plan on publishing much more about creating data lakes, data hub clusters and much more. Stay tuned!
... View more
10-24-2019
07:30 AM
@pvidal @yuvapraveen_k I have the same issue as above nifi 1.9.2, nifi registry 0.5, rhel 7.XX providers.xml has not been changed apart from the steps above flow storage directory = /opt/nifi-registry-0.5.0/gitdir remote to push= origin remote access user = user remote access password = userpassword other things in the conf that were there to begin with flow storage directory = ./flow_storage extension bundle storage directory=./extension_bundles
... View more
06-27-2019
02:17 PM
2 Kudos
Introduction Let's jump into tutorial 2 from my AI to Edge series! This tutorial details the creation of a Nifi flow executing the ONNX model we trained in my last article. More precisely we will try to feed these 3 handwritten digits and predict their value: Note: as always, all code/files referenced in this tutorial can be found on my github, here. Agenda Below is an overview of the flow: As you can see, the flow is divided in the following sections: Section 1: Listening to a folder for new png files Section 2: Resizing these images to 28x28 (size used to train our model) Section 3: Converting these images to CSV (format used to train our model) Section 4: Running our predictive model Section 1: Listening to a folder for new png files Step 1: Setup a variable for the root folder This will be useful when we deploy the flow to a minifi flow. Go to your variables and create the following: Name: root_folder Value: location of your download of my github Step 2: List files in folder Create a ListFiles processor and modify the following properties: Input Directory: ${root_folder}NIFI/png/original/ File Filter: [^\.].*.png Step 3: Fetch files in folder Create a FetchFiles processor with default parameters. Note: The List/Fetch paradigm is very powerful because it will allow us to continuously look for new images without reprocessing all of them. ListFiles is a stateful processor. If you're unfamiliar with the concept I encourage you to read about it on this community. Section 2: Resizing these images to 28x28 Step 1: Resize Image Create a ResizeImage processor and modify the following properties: Image Width (in pixels): 28 Image Height (in pixels): 28 Step 2: Enter out attributes for resized images Create an UpdateAttribute processor, aimed at defining the folder and filename of the resized images, by adding the following properties to the processor: filedirectory: ${root_folder}NIFI/png/resized/ filename: resized_${filename} Section 3: Converting these images to CSV Step 1: Saving modified image Create a PutFile processor and modify the following properties to store the converted image in the resized folder: Directory: ${filedirectory} Step 2: Execute a python script to convert images to CSV In this step we will create an ExecuteStreamCommand processor that will run the convertImg.sh python script. The script takes the resized image file, converts it to grayscale, and converts it into an inverted CSV to match the input of our model. Below is the script itself: #!/usr/bin/env python3
import os,png,array
import pandas as pd
import time
import sys
from PIL import Image
from PIL import ImageOps
columnNames = list()
for i in range(784):
pixel = 'pixel'
pixel += str(i)
columnNames.append(pixel)
train_data = pd.DataFrame(columns = columnNames)
start_time = time.time()
img_name = sys.argv[1]
img = Image.open(img_name)
img = img.convert('LA')
rawData = img.load()
data = []
for y in range(28):
for x in range(28):
data.append(rawData[x,y][0])
print(i)
k = 0
#print data
train_data.loc[i] = [255-data[k] for k in range(784)]
csvFile = sys.argv[2]
print(csvFile)
train_data.to_csv(csvFile,index = False) As you can see it expects two arguments: Location of the resized image ( img_name = sys.argv[1] ) Location of the target CSV ( csvFile = sys.argv[2] ) Thus, you will modify the following properties in the ExecuteStreamCommand processor: Command Arguments: ${root_folder}NIFI/png/resized/${filename};${root_folder}NIFI/csv/${filename}.csv Command Path: ${root_folder}NIFI/convertImg.sh Section 4: Running our predictive model Step 1: Enter input attributes for model execution Create an UpdateAttribute processor, aimed at defining the locations of the CSV file and the ONNX model, by adding the following properties to the processor: filename: ${root_folder}NIFI/csv/${filename}.csv onnxModel: ${root_folder}NOTEBOOKS/model.onnx Step 2: Use python to run the model with onnxruntime In this step we will create an ExecuteStreamCommand processor that will run the runModel.sh python script. The script takes the CSV version of the image and run the ONNX model created in the last tutorial with this CSV as an input. Below is the script itself: #!/usr/bin/env python3
import onnxruntime as rt
import onnx as ox
import numpy
import pandas as pd
import shutil
import sys
test=pd.read_csv(sys.argv[1])
X_test = test.values.astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28,1)
session = rt.InferenceSession(sys.argv[2])
input_name = session.get_inputs()[0].name
label_name = session.get_outputs()[0].name
prediction = session.run([label_name], {input_name: X_test.astype(numpy.float32)})[0]
number = 0
for i in range(0, 9):
if (prediction[0][i] == 1.0):
number = i
print(number) As you can see it expects two arguments: Location of the CSV ( test=pd.read_csv(sys.argv[1]) ) Location of the ONNX model ( session = rt.InferenceSession(sys.argv[2]) ) Thus, you will modify the following properties in the ExecuteStreamCommand processor: Command Arguments: ${filename};${onnxModel} Command Path: ${root_folder}NIFI/runModel.sh Results If you run the flow against the image in my github, you will see 3 output flowfiles, predicting the value of the handwritten digit, like shown below:
... View more
06-06-2019
01:57 PM
1 Kudo
Introduction Time for the tutorial 1 of a series detailing how to go from AI to Edge! Note: all code/files referenced in this tutorial can be found on my github, here. Agenda This tutorial is divided in the following sections: Section 1: Create a custom Docker container running Jupyter for CDSW Section 2: Automate Jupyter launch in a CDSW project Section 3: Train and save a model reading MNSIT database Section 1: Create a custom Docker container running Jupyter for CDSW This is fairly straight forward to implement, as it is detailed in the official documentation. Note: make sure that dock is signed in with your Dockerhub username/password (not email) otherwise the docker push will not work. Step 1: Create a repository in docker hub Go to docker hub and sign in with your account. Create a new repository as follows: You should see something like this: Step 2: Creating a custom docker file Go to a folder on your computer can create this docker file (saving it as Dockerfile 😞 FROM docker.repository.cloudera.com/cdsw/engine:7
RUN pip3 install --upgrade pip
RUN pip3 install keras
RUN pip3 install tensorflow
RUN pip3 install sklearn
RUN pip3 install jupyter
RUN pip3 install 'prompt-toolkit==1.0.15'
RUN pip3 install onnxruntime
RUN pip3 install keras2onnx Step 3: Build the container Run the following command in the folder where the file has been saved: docker build -t YOUR_USER/YOUR_REPO:YOUR_TAG . -f Dockerfile Step 4: Publish it to docker hub Run the following command on your computer: docker push YOUR_USER/YOUR_REPO:YOUR_TAG Section 2: Automate Jupyter launch in a CDSW project Step 1: Create a shell script to run Jupyter In CDSW 1.5, you can't add a CMD or an ENTRYPOINT to your docker file. Therefore, you will need to add a .bashrc file to your CDSW project, with the following code: processes=`ps -ef | grep jupyter | wc -l`
if (( $processes == 2 )) ; then
echo "Jupyter is already running!"
elif (( $processes == 1 )) ; then
jupyter notebook --no-browser --ip=0.0.0.0 --port=8080 --NotebookApp.token=
else
echo "Invalid number of processes, relaunch your session!"
fi Save this file to a github repository. Step 2: Add the custom engine to CDSW In CDSW config, use the docker hub image you created as your default engine: Step 3: Create a project in CDSW with .bashrc In CDSW, create a new project using the github repository you just created: Note: You can create a blank project and add the .bashrc file to it, but this automates it. Step 4: Launch a CDSW session with Jupyter In your project, open workbench and launch a session with your custom engine. Run terminal access and Jupyter will launch. You will then see the following on your 9 dots, allowing you to run Jupyter: Section 3: Train and save a model reading MNSIT database The model training is very well explained in the original Kaggle article that can be found here. A reviewed version of this notebook can be found on my github. The main thing that was added to the notebook is the publishing of the model: # Convert into ONNX format with onnxmltools
import keras2onnx
onnx_model = keras2onnx.convert_keras(model, model.name)
import onnx
temp_model_file = 'model.onnx'
onnx.save_model(onnx_model, temp_model_file) After the notebook runs, you should see the model.onnx file created.
... View more
06-06-2019
01:16 PM
2 Kudos
Introduction Machine Learning and Artificial Intelligence frameworks are numerous and their impact on the future of computer science need no emphasis. However, deploying these models can be complex and fairly manual without the appropriate multi-function ecosystem, especially when deploying these models to the edge. Luckily, Cloudera comprehensive data management suite make this endeavor very easy.During this series, I will present how to create a Deep Learning model trained to read digits from the MNIST database and deploy it to the edge. This article is an introduction to the architecture and pre-requisites necessary for this tutorial. It will refer to sub articles that will be tutorials that anyone can follow to implement understand how to take an AI model and operationalize it to the edge Architecture overview The figure below gives a highlight of my hybrid cloud platform: As you can see, it is comprised of the three main functions: Cloudera Data Science Workbench: data science hub used to train and save the model, leveraging like Tensorflow, Jupyter and ONNX. Cloudera Flow Management: leverages Nifi to develop a flow reading from an image and running the ONNX model. Cloudera Edge Management: allows for the deployment of Minifi flows to edge deployment Pre-Requisites To run this tutorial I used the following main elements of the Cloudera Stack: CDSW 1.5 HDF 3.3 HDP 3.1 CEM 1.0 Implementation Tutorials The implementation will be detailed in the upcoming following tutorial articles: Part 1: CDSW model training using a custom docker image with Jupyter and save it using ONNX Part 2: Nifi flow creation to parse new images and run the model Part 3: Flow deployment to Minifi using CEM
... View more
05-19-2019
04:46 PM
The above was originally posted in the Community Help track. On Sun May 19 16:41 UTC 2019, the HCC moderation staff moved it to the Data Ingestion & Streaming track. The Community Help Track is intended for questions about using the HCC site itself.
... View more
05-19-2019
04:55 PM
The above was originally posted in the Community Help track. On Sun May 19 16:49 UTC 2019, the HCC moderation staff moved it to the Cloud & Operations Track. The Community Help Track is intended for questions about using the HCC site itself.
... View more