Community Articles

Anshul_Gupta · ‎10-26-2023

Table of Contents

Overview
Design
Design Explanation
Implementation
Prerequisites
Step #1 - Setup Cloudera DataFlow (CDF)
Step #2 - Setup Cloudera Data Warehouse (CDW)
Step #3 - Execute

Overview

Cloudera and IBM have partnered to create industry-leading, enterprise-grade data and AI services using open-source ecosystems—all designed to achieve faster data and analytics at scale.

This article shows an end-to-end flow to process real-time unstructured data with GenAI using Cloudera's DataFlow and Data Warehouse, and IBM's watsonx.ai.

Design

Design Explanation

Based on the incoming documents in AWS S3 bucket, NiFi prepares the input for all the watsonx.ai models.
NiFi calls watsonx.ai model (granite-13b-instruct-v1) to Extract the key fields in the document. See the sample IBM watsonx.ai prompt below.
NiFi calls watsonx.ai model (granite-13b-chat-v1) to Summarize the information in the document. See the sample IBM watsonx.ai prompt below.
NiFi calls watsonx.ai model (granite-13b-instruct-v1) to generate an Email with all the necessary information, for the user who submitted the document. See the sample IBM watsonx.ai prompt below.
Using the generated response, NiFi prepares and sends an email to the user.
NiFi stores all the responses from the model invocations in AWS S3 bucket, which are eventually read by the Hive table in Data Warehouse.

Implementation

Prerequisites

A Cloudera Data Platform (CDP) Public Cloud environment on Amazon Web Services (AWS). If you don't have an existing environment, follow instructions here to set one up - CDP/AWS Quick Start Guide.
IBM Cloud account.
Get IAM Token from IBM
1. Create IBM API Key here.
2. Get Token - curl -X POST 'https://iam.cloud.ibm.com/identity/token' -H 'Content-Type: application/x-www-form-urlencoded' -d 'grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey=_<YOUR_API_KEY>_'
  Note that this token expires in one hour.

Step #1 - Setup Cloudera DataFlow (CDF)

Go to CDF user interface, and ensure CDF service is enabled in your CDP environment.
In Catalog, import the following flow definition - Cloudera_Watsonx_Flow.json
Select imported flow, click on Deploy, select the Target Environment and begin the deployment process.
During the deployment, it's going to ask about the following parameters that this NiFi Flow requires to function:
- s3_access_key - Ensure that AWS IAM user you're using, has "AmazonS3FullAccess" permissions. Visit Understanding and getting your AWS credentials for help if required.
- s3_secret_access_key - same instructions as s3_access_key.
- s3_bucket - AWS S3 bucket name. Eg: iceberg-presto.
- s3_input_path - Subdirectory in AWS S3 bucket in which you're staging your input data. Eg: data/claim.
- s3_output_path - Subdirectory in AWS S3 bucket in which you're storing your output data. Eg: data/watsonx_response.
- watsonx_model_url - IBM Watsonx.ai model URL. Eg: https://us-south.ml.cloud.ibm.com/ml/v1-beta/generation/text?version=2023-05-29.
- watsonx_bearer_token - IBM's IAM Token that you retrieved earlier in the prerequisites.
Extra Small NiFi node size is enough for this data ingestion.
After deployment is done, you would be able to see the flow in Dashboard.
All NiFi Flow parameters can be updated while the flow is running, from Deployment Manager. As soon as you Apply Changes, running processors that are impacted by the Parameter changes will automatically be restarted.

Step #2 - Setup Cloudera Data Warehouse (CDW)

Go to CDW user interface. Ensure CDW service is activated in your CDP environment, and a Database Catalog & a Virtual Warehouse compute cluster are available for use.
In Hue editor, execute query.sql. This query creates an external table that points to your S3 Bucket's output path. Please change AWS S3 location in the query before executing it.
After the query execution is successful, you will see model_response table under default database.

Step #3 - Execute

Ensure your Cloudera Watsonx Flow is started. If it's not, do the following to start it -- CDF Dashboard >> Deployment Manager >> Action >> Start Flow.
Drop files in your S3 Bucket's input path. A couple of sample input files are provided in assets directory for reference.
After a few seconds, notice the output in your S3 Bucket's output path.
You can also go in Hue and query the table - SELECT * FROM default.model_response;.
In the end, a notification email goes out to the user acknowledging the receipt of the document.

Demo Recording is available here on IBM Media Center.

Enjoy!

moekraft · ‎09-20-2024

Anshul, I am looking forward to continuing this story at IBM TechXchange 2024 as we talk about Process unstructured data in real-time with GenAI while ensuring compliance with AI regulations

Cloudera Community

Community Articles

Processing real-time unstructured data with GenAI using Cloudera and IBM watsonx.ai

Apache Hive

Apache NiFi

Cloudera Data Warehouse (CDW)

Cloudera DataFlow (CDF)

Cloudera Hue

Overview

Design

Design Explanation

Implementation

Prerequisites

Step #1 - Setup Cloudera DataFlow (CDF)

Step #2 - Setup Cloudera Data Warehouse (CDW)

Step #3 - Execute

Re: Processing real-time unstructured data with GenAI using Cloudera and IBM watsonx.ai

Creating and using Custom Airflow Operators in Clo...

Using Cloudera Flow Management To Ingest and Proc...

Integration Apache OpenNLP 1.8.4 into Apache NiFi ...

Real-Time SQL On Event Streams

Connecting Cloudera Machine Learning to Cloudera D...

Start process group using nifi REST API

Real-time Analysis of Twitter using Impala

Real time campaign

Real Time Ingesting and Transforming Sensor and So...

Using Apache NiFi with IBM Bluemix IoT Cloud