Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
05-05-2020
06:11 PM
4 Kudos
This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud (tested with CDP Public Cloud Datahub runtime 7.1).
Note: you must have admin privileges on your datahub cluster do this configuration.
Step 1: Whitelist the path to the HWC jar
In your CDP Datahub, open Cloudera Manager (CM) via the link in the management console:
In CM, go to your cluster > Livy > Configuration, and search for livy-conf/livy.conf:
Add the following safety valve to the configuration file: livy.file.local-dir-whitelist=/path_for_hwc/ In our example, we are using the /tmp/ folder.
Restart the Livy service via CM to propagate the configuration.
Step 2: Copy the HWC jar to the whitelisted location
Find the hostname of the node where Livy is installed (master3 here):
Connect to the node using your user/workload password, e.g.: ssh pvidal@viz-data-engineering-master3.viz-cdp.a465-9q4k.cloudera.site
Find the HWC jar: [pvidal@viz-data-engineering-master3 /]$ find / -name *hive-warehouse-connector* 2>/dev/null
./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
Copy it and add the right permissions: cp /opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar /tmp
chmod a+rw /tmp/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
Step 3: Add jar path to Zeppelin Livy interpreter
From your management console, open Zeppelin:
Go to the top right, and configure your Interpreters:
Edit the livy interpreter and add the following properties: HWC Jar Location Name: livy.spark.jars
Value: file:///[LOCATION_OF_YOUR_HWC_JAR] Hive JDBC URL Name: livy.spark.sql.hive.hiveserver2.jdbc.url
Value: [JDBC_URL_FROM_MANAGEMENT_CONSOLE];user=[your_user];password=[your_password]
You can find the JDBC URL in your datahub management console: Read via LLAP Name: livy.spark.datasource.hive.warehouse.read.via.llap
Value: false JDBC mode Name: livy.spark.datasource.hive.warehouse.read.jdbc.mode
Value: client Staging Dir Name: livy.spark.datasource.hive.warehouse.load.staging.dir
Value: /tmp Metastore URI Name: livy.spark.datasource.hive.warehouse.metastoreUri
Value: [VALUE_FROM_HIVE_SITE_XML]
You can download hive-site.xml from CM, by going to Your Cluster > Hive > Download Client Configuration:
Save your configuration, and restart your interpreter.
Step 4: Code away
Here is a simple example of Spark reading from a CSV and writing to a Hive table using HWC:
Read raw location data
val locationDf = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv("s3a://viz-cdp-bucket/raw/locations.csv") locationDf.printSchema()
Setup HWC session
import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build()
Create database and save dataset to table
hive.executeUpdate("CREATE DATABASE worldwidebank"); hive.setDatabase("worldwidebank"); locationDf.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "locations").save()
Query data
val ds = hive.sql("select * from locations limit 10") ds.show()
... View more
11-14-2019
08:14 AM
2 Kudos
CDP has recently been introduced to the market, and we all want to learn about it, so I decided to contribute a bit.
In this series of tutorials, I want to explain the basics of CDP Data Hub and provide some ease of automation. Indeed, the CDP control plane user interface provides a thorough workflow that automates the creation of main elements that ensures security, governance and scalability.
As depicted above, these main elements are:
CDP Environment: Cloud hosted (in your cloud) resources for CDP deployment
CDP Data Lake: Host of CDP SDX, the shared service layer providing all security and governance
CDP Data Hub cluster(s): where users run their workloads
Note: CDP offers a lot more than these basic elements, and a lot more ways to configure them; this is a 101 tutorial and will not address all that CDP has to offer. For more information, visit the CDP documentation or product page.
Instead, this tutorial series will teach you:
How to create a CDP environment in AWS with minimal requirements
How to create a datalake from an existing environment
How to launch a CDP Data Hub cluster via CLI
Happy scripting!
... View more
Labels:
11-14-2019
07:48 AM
Introduction
Continuing my previous article on creating a CDP AWS environment, and a CDP data lake, this tutorial teaches you how to automate the creation of a simple data engineering data hub cluster. You'll notice that once a data lake is setup, launching data hub clusters is very easy!
The cluster generated has the following properties:
Template: CDP 1.1 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie
Nodes:
1 Master m5.2xlarge
3 Workers m5.2xlarge
Here is the TL;DR: go to my github and run the scripts as instructed.
Automation scripts
Step 1: Create Data Hub Cluster
cdp_create_dh_de.sh <prefix>
Step 2: Verify periodically until cluster status is AVAILABLE
cdp_describe_dh_de.sh <prefix>
That's it!
... View more
Labels:
11-14-2019
05:44 AM
Introduction Continuing my previous article on creating a CDP AWS environment, this tutorial teaches you how to automate the creation of a datalake, including: Setting up proper roles and IDBroker mappings associated with your user and environment Creating a datalake Syncing users to FreeIPA Here is the TL;DR: go to my github and run the scripts as instructed. Automation scripts Step 1: Create IAM and launch Data Lake Create roles and mapping in your existing environment: cdp_create_group_iam.sh <base_dir> <prefix> <region> Create datalake: cdp_create_datalake.sh <base_dir> <prefix> Step 2: Verify periodically until datalake status is RUNNING cdp_describe_dl.sh <prefix> Step 3: Sync free IPA users cdp_sync_users.sh <base_dir> <prefix> Conclusion This was a short and sweet tutorial, more fun to come playing with data lake clusters and experiences!
... View more
Labels:
11-13-2019
04:21 PM
4 Kudos
Introduction Cloudera Data Platform (CDP) for public cloud has an amazing admin UI that drives you through a thorough wizards for setting up environment, data lakes, data hub clusters and experiences. Details of AWS pre-requisites can be found in CDP official documentation, but why use the easy button when you can leverage AWS and CDP cli and do it the hard way? My thoughts exactly. Here is the TL;DR: go to my github and run the scripts as instructed. AWS minimal requirements for CDP environment Even through cli, a lot of the configuration of a environment can be automated, including: Network (VPCs, routing, gateways, etc.) Security Groups Free IPA instance(s) However, you will need to set this minimal set of elements to get an environment ready: In AWS Public/Private keys (not automated/demonstrated here) S3 Bucket AWS Roles AWS policies In CDP Credential (not automated/demonstrated here) To better understand how roles, policies and bucket interact, you can refer to this diagram: In this tutorial however, I'm not going to worry about additional roles but the dataake_admin_role and ranger_audit_role (the other ones are optional). Automation scripts Step 1: Pre-Requisites AWS Install and configure the AWS cli with your account: link Create a public/private key pair: link CDP Create CDP credential for AWS: link Install and configure CDP CLI: link Local computer Clone my github repository git clone https://github.com/paulvid/cdp_create_env_aws.git Step 2: Running the scripts Create AWS S3 bucket: aws_create_bucket.sh <base_dir> <prefix> <region> Purge AWS policies and roles (optional): aws_purge_roles_policies.sh <base_dir> <prefix> Create AWS policies: aws_create_policies.sh <base_dir> <prefix> Create AWS roles: aws_create_roles.sh <base_dir> <prefix> <bucket> Create CDP environment: cdp_create_env.sh <base_dir> <prefix> <credential> <region> <key> Step 3: Verify periodically until environment status is AVAILABLE cdp_describe_env.sh <prefix> Conclusion Obviously this is just getting us started. I plan on publishing much more about creating data lakes, data hub clusters and much more. Stay tuned!
... View more
09-04-2019
08:48 AM
@yuvapraveen_k What's your configuration exactly (if you could paste it without your github key, that'd be great)? Are you using Apache Nifi/Nifi-Registry or are you using HDF? Which versions are you running?
... View more
06-27-2019
02:17 PM
2 Kudos
Introduction Let's jump into tutorial 2 from my AI to Edge series! This tutorial details the creation of a Nifi flow executing the ONNX model we trained in my last article. More precisely we will try to feed these 3 handwritten digits and predict their value: Note: as always, all code/files referenced in this tutorial can be found on my github, here. Agenda Below is an overview of the flow: As you can see, the flow is divided in the following sections: Section 1: Listening to a folder for new png files Section 2: Resizing these images to 28x28 (size used to train our model) Section 3: Converting these images to CSV (format used to train our model) Section 4: Running our predictive model Section 1: Listening to a folder for new png files Step 1: Setup a variable for the root folder This will be useful when we deploy the flow to a minifi flow. Go to your variables and create the following: Name: root_folder Value: location of your download of my github Step 2: List files in folder Create a ListFiles processor and modify the following properties: Input Directory: ${root_folder}NIFI/png/original/ File Filter: [^\.].*.png Step 3: Fetch files in folder Create a FetchFiles processor with default parameters. Note: The List/Fetch paradigm is very powerful because it will allow us to continuously look for new images without reprocessing all of them. ListFiles is a stateful processor. If you're unfamiliar with the concept I encourage you to read about it on this community. Section 2: Resizing these images to 28x28 Step 1: Resize Image Create a ResizeImage processor and modify the following properties: Image Width (in pixels): 28 Image Height (in pixels): 28 Step 2: Enter out attributes for resized images Create an UpdateAttribute processor, aimed at defining the folder and filename of the resized images, by adding the following properties to the processor: filedirectory: ${root_folder}NIFI/png/resized/ filename: resized_${filename} Section 3: Converting these images to CSV Step 1: Saving modified image Create a PutFile processor and modify the following properties to store the converted image in the resized folder: Directory: ${filedirectory} Step 2: Execute a python script to convert images to CSV In this step we will create an ExecuteStreamCommand processor that will run the convertImg.sh python script. The script takes the resized image file, converts it to grayscale, and converts it into an inverted CSV to match the input of our model. Below is the script itself: #!/usr/bin/env python3
import os,png,array
import pandas as pd
import time
import sys
from PIL import Image
from PIL import ImageOps
columnNames = list()
for i in range(784):
pixel = 'pixel'
pixel += str(i)
columnNames.append(pixel)
train_data = pd.DataFrame(columns = columnNames)
start_time = time.time()
img_name = sys.argv[1]
img = Image.open(img_name)
img = img.convert('LA')
rawData = img.load()
data = []
for y in range(28):
for x in range(28):
data.append(rawData[x,y][0])
print(i)
k = 0
#print data
train_data.loc[i] = [255-data[k] for k in range(784)]
csvFile = sys.argv[2]
print(csvFile)
train_data.to_csv(csvFile,index = False) As you can see it expects two arguments: Location of the resized image ( img_name = sys.argv[1] ) Location of the target CSV ( csvFile = sys.argv[2] ) Thus, you will modify the following properties in the ExecuteStreamCommand processor: Command Arguments: ${root_folder}NIFI/png/resized/${filename};${root_folder}NIFI/csv/${filename}.csv Command Path: ${root_folder}NIFI/convertImg.sh Section 4: Running our predictive model Step 1: Enter input attributes for model execution Create an UpdateAttribute processor, aimed at defining the locations of the CSV file and the ONNX model, by adding the following properties to the processor: filename: ${root_folder}NIFI/csv/${filename}.csv onnxModel: ${root_folder}NOTEBOOKS/model.onnx Step 2: Use python to run the model with onnxruntime In this step we will create an ExecuteStreamCommand processor that will run the runModel.sh python script. The script takes the CSV version of the image and run the ONNX model created in the last tutorial with this CSV as an input. Below is the script itself: #!/usr/bin/env python3
import onnxruntime as rt
import onnx as ox
import numpy
import pandas as pd
import shutil
import sys
test=pd.read_csv(sys.argv[1])
X_test = test.values.astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28,1)
session = rt.InferenceSession(sys.argv[2])
input_name = session.get_inputs()[0].name
label_name = session.get_outputs()[0].name
prediction = session.run([label_name], {input_name: X_test.astype(numpy.float32)})[0]
number = 0
for i in range(0, 9):
if (prediction[0][i] == 1.0):
number = i
print(number) As you can see it expects two arguments: Location of the CSV ( test=pd.read_csv(sys.argv[1]) ) Location of the ONNX model ( session = rt.InferenceSession(sys.argv[2]) ) Thus, you will modify the following properties in the ExecuteStreamCommand processor: Command Arguments: ${filename};${onnxModel} Command Path: ${root_folder}NIFI/runModel.sh Results If you run the flow against the image in my github, you will see 3 output flowfiles, predicting the value of the handwritten digit, like shown below:
... View more
06-06-2019
01:57 PM
1 Kudo
Introduction Time for the tutorial 1 of a series detailing how to go from AI to Edge! Note: all code/files referenced in this tutorial can be found on my github, here. Agenda This tutorial is divided in the following sections: Section 1: Create a custom Docker container running Jupyter for CDSW Section 2: Automate Jupyter launch in a CDSW project Section 3: Train and save a model reading MNSIT database Section 1: Create a custom Docker container running Jupyter for CDSW This is fairly straight forward to implement, as it is detailed in the official documentation. Note: make sure that dock is signed in with your Dockerhub username/password (not email) otherwise the docker push will not work. Step 1: Create a repository in docker hub Go to docker hub and sign in with your account. Create a new repository as follows: You should see something like this: Step 2: Creating a custom docker file Go to a folder on your computer can create this docker file (saving it as Dockerfile 😞 FROM docker.repository.cloudera.com/cdsw/engine:7
RUN pip3 install --upgrade pip
RUN pip3 install keras
RUN pip3 install tensorflow
RUN pip3 install sklearn
RUN pip3 install jupyter
RUN pip3 install 'prompt-toolkit==1.0.15'
RUN pip3 install onnxruntime
RUN pip3 install keras2onnx Step 3: Build the container Run the following command in the folder where the file has been saved: docker build -t YOUR_USER/YOUR_REPO:YOUR_TAG . -f Dockerfile Step 4: Publish it to docker hub Run the following command on your computer: docker push YOUR_USER/YOUR_REPO:YOUR_TAG Section 2: Automate Jupyter launch in a CDSW project Step 1: Create a shell script to run Jupyter In CDSW 1.5, you can't add a CMD or an ENTRYPOINT to your docker file. Therefore, you will need to add a .bashrc file to your CDSW project, with the following code: processes=`ps -ef | grep jupyter | wc -l`
if (( $processes == 2 )) ; then
echo "Jupyter is already running!"
elif (( $processes == 1 )) ; then
jupyter notebook --no-browser --ip=0.0.0.0 --port=8080 --NotebookApp.token=
else
echo "Invalid number of processes, relaunch your session!"
fi Save this file to a github repository. Step 2: Add the custom engine to CDSW In CDSW config, use the docker hub image you created as your default engine: Step 3: Create a project in CDSW with .bashrc In CDSW, create a new project using the github repository you just created: Note: You can create a blank project and add the .bashrc file to it, but this automates it. Step 4: Launch a CDSW session with Jupyter In your project, open workbench and launch a session with your custom engine. Run terminal access and Jupyter will launch. You will then see the following on your 9 dots, allowing you to run Jupyter: Section 3: Train and save a model reading MNSIT database The model training is very well explained in the original Kaggle article that can be found here. A reviewed version of this notebook can be found on my github. The main thing that was added to the notebook is the publishing of the model: # Convert into ONNX format with onnxmltools
import keras2onnx
onnx_model = keras2onnx.convert_keras(model, model.name)
import onnx
temp_model_file = 'model.onnx'
onnx.save_model(onnx_model, temp_model_file) After the notebook runs, you should see the model.onnx file created.
... View more
06-06-2019
01:16 PM
2 Kudos
Introduction Machine Learning and Artificial Intelligence frameworks are numerous and their impact on the future of computer science need no emphasis. However, deploying these models can be complex and fairly manual without the appropriate multi-function ecosystem, especially when deploying these models to the edge. Luckily, Cloudera comprehensive data management suite make this endeavor very easy.During this series, I will present how to create a Deep Learning model trained to read digits from the MNIST database and deploy it to the edge. This article is an introduction to the architecture and pre-requisites necessary for this tutorial. It will refer to sub articles that will be tutorials that anyone can follow to implement understand how to take an AI model and operationalize it to the edge Architecture overview The figure below gives a highlight of my hybrid cloud platform: As you can see, it is comprised of the three main functions: Cloudera Data Science Workbench: data science hub used to train and save the model, leveraging like Tensorflow, Jupyter and ONNX. Cloudera Flow Management: leverages Nifi to develop a flow reading from an image and running the ONNX model. Cloudera Edge Management: allows for the deployment of Minifi flows to edge deployment Pre-Requisites To run this tutorial I used the following main elements of the Cloudera Stack: CDSW 1.5 HDF 3.3 HDP 3.1 CEM 1.0 Implementation Tutorials The implementation will be detailed in the upcoming following tutorial articles: Part 1: CDSW model training using a custom docker image with Jupyter and save it using ONNX Part 2: Nifi flow creation to parse new images and run the model Part 3: Flow deployment to Minifi using CEM
... View more
05-15-2019
07:10 PM
4 Kudos
Introduction CFM is out for CDH and that's awesome. While the install documentation is very thorough, I thought I'd share a screenshot based tutorial, avoiding some of the gotchas of the installation Pre-Requisites Healthy CDH 6.2 cluster Admin privileges on the cluster and its nodes wget installed Open JDK 1.8 installed (see doc) Zookeeper service installed (see doc) Step 1: Get the CFM CSD Files Connect to your Cloudera Manager node as root , and enter the following: cd /opt/cloudera/csd
wget http://archive.cloudera.com/CFM/csd/1.0.0.0/NIFI-1.9.0.1.0.0.0-90.jar
wget http://archive.cloudera.com/CFM/csd/1.0.0.0/NIFICA-1.9.0.1.0.0.0-90.jar
wget http://archive.cloudera.com/CFM/csd/1.0.0.0/NIFIREGISTRY-0.3.0.1.0.0.0-90.jar
chown cloudera-scm:cloudera-scm NIFI*.jar
chmod 644 NIFI*.jar
service cloudera-scm-server restart After restart, log to CM and restart the Cloudera Management Service: Wait for the restart to finish: Step 2: Get the CFM Parcel Go to the parcels configuration (parcel icon top right): Go to parcel configuration and add http://archive.cloudera.com/CFM/parcels/1.0.0.0/ to the parcel repository list by clicking on the "+" icon: Click on the "Check for New Parcels" button: You should see a CFM parcel, click on "Download": After the download completes, distribute and activate the parcel. Step 4: Install services Go to your CM main page, and click on "Add Service": Select the Nifi service: Select the host onto which you want to install Nifi: Click on next and wait for the service to install. Step 5: Profit You can now go to your Nifi URL: Bonus: Changing election timeout By default, the nifi.cluster.flow.election.max.wait.time property is set to 5 minutes, which can be confusing by giving rise to a log message like Cluster is still in the process of voting on the appropriate Data Flow . You can modify the property by adding it to the nifi.properties.xml in CM, as follows: Restart your Nifi instance via CM, and voilà!
... View more
Labels: