Member since
11-28-2016
17
Posts
43
Kudos Received
0
Solutions
03-05-2026
12:58 PM
Cloudera AI Code Assistant Development Using CursorAI and CAI Inference Service As agencies build out their AI Factories, the biggest challenge is often integrating complex models into secure application development teams. By leveraging LLMs hosted on Cloudera AI Inference Service, organizations can work directly in IDE software development tools, such as CursorAI, to provide AI-assisted coding, but without the risk of IP leakage, governance policy violations, and/or data sovereignty related issues. This solution is enterprise-ready, developer-friendly, and provides seamless integration. Why With the advent of AI assisted coding tools powered by LLMs completely changing the game, Organizations that are adopting the methodology to their software development practice are seeing orders of magnitude increase in productivity from their team, reduced time for features to get to market, and increased team satisfaction. In other words, this shift is redefining how teams are structured and staffed, while accelerating the delivery of high-demand features. Naturally, there are risks associated with all new methodologies, and this is not an exception. By leveraging publicly-hosted LLMs by providers for code development, organizations run the risk of pushing secure or sensitive proprietary information as context in a prompt to the models. This could not only expose information on the customer or user base of the Enterprise, but it also can include the application code as well. This could be later used by nefarious actors to discover day one attacks or other black hat related activities. By leveraging LLMs hosted on Cloudera AI Inference Service, Organizations can work directly in IDE software development tools, such as CursorAI, to provide AI-assisted coding, but without the risk of IP leakage, governance policy violations, and/or data sovereignty related issues. Developers and practitioners get the speed and fluency of AI-native tooling without compromising on enterprise compliance, IP protection, or data sovereignty. What By enforcing software development teams to use only LLMs hosted on Cloudera AI Inference Service, Organizations can leverage AI coding assistants in air-gapped environments, meet compliance requirements, control cost, and adhere to security policies. CursorAI is a powerful framework for working with Cloudera-hosted AI models, providing seamless integration with Agent coding assistants for building intelligent applications. This library offers a unified interface to deploy, manage, and query models with enterprise-grade security and scalability. Cloudera's coding agent helps you interact with and manage your AI workflows. It provides: Natural language interface to work with models Automated model selection and optimization Integration with Cloudera's data platform Secure access to enterprise AI code assistant capabilities How For this technical blog, I will be using the IDE CursorAI because the product offers features for agent based code assist, which can be configured to use models with custom end points. Please feel free to read more about their product offering. There are other IDE options available to use that will follow a very similar process. Who This blog is designed and written for many different individuals which include technical practitioners, mission/product owners, and decision makers. Technical practitioners, who are the builders of the software products, will be able to reproduce the results by following the detailed instructions below to dramatically speed up the software development process. Product/Mission owners can leverage this document to meet their timelines for new product features and as well as new releases that could include bug fixes. In other words, keep their customers happy. Decision makers owners can leverage this document to architect AI Code Assist solutions in private environments allowing their teams to take advantage of the latest technical innovations while maintaining enterprise security and data sovereignty. Getting Started In order to follow this blog, the user will need an instance of Cloudera AI with a LLM hosted in Cloudera Inference Service. They will also need to be able to access or generate an API key to the models that are hosted there. From there, a user would need to create a new project in CAI workbench using the following GitHub repo to leverage the prebuilt testing and validation scripts. The GitHub repo can be found here. What You’re Working With: Cursor – AI-powered IDE that can use custom, OpenAI-compatible endpoints for chat and completions. Cloudera AI Platform – Hosts embedding and LLM endpoints (e.g. NVIDIA Nemotron, E5 embeddings) in your environment. ClouderaAgent – Python agent in this repo that talks to those Cloudera endpoints: embeddings for semantic search and (optionally) an LLM for RAG and chat-style generation. This guide focuses on using the ClouderaAgent in your code (embeddings, LLM, RAG) and on pointing Cursor at your Cloudera LLM so you can code with Cursor while all model traffic stays on Cloudera. Prerequisites Cloudera AI Platform access Cloudera endpoint URLs (for embeddings and/or LLM) API key (JWT token) from Cloudera Cursor IDE installed (optional, for agent window integration) Installation # 1. Clone the repository
git clone https://github.com/BrooksIan/Cloudera-Inference-With-CursorAI.git
cd Cloudera-Inference-With-CusorAI
# 2. Create virtual environment
python3 -m venv venv
# 3. Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# 4. Install dependencies
pip install -r requirements.txt Configuration The user will need to create or modify a configuration file, which contains the environment variables used by the project. Configuration Files The ClouderaAgent reads: Embeddings: configs/config.json (or env vars). LLM (optional): configs/config-llm.json or an llm_endpoint section in config.json. Embedding config (configs/config.json) {
"endpoint": {
"base_url": "https://YOUR-CLOUDERA-SITE/namespaces/serving-default/endpoints/YOUR-EMBEDDING-ENDPOINT/v1"
},
"models": {
"query_model": "nvidia/nv-embedqa-e5-v5-query",
"passage_model": "nvidia/nv-embedqa-e5-v5-passage"
},
"api_key": "YOUR_JWT_OR_API_KEY",
"embedding_dim": 1024
} LLM config (configs/config-llm.json) {
"llm_endpoint": {
"base_url": "https://YOUR-CLOUDERA-SITE/namespaces/serving-default/endpoints/YOUR-LLM-ENDPOINT/v1",
"model": "nvidia/llama-3.3-nemotron-super-49b-v1"
},
"api_key": "YOUR_JWT_OR_API_KEY"
} Use the exact endpoint paths and model IDs from your Cloudera AI Platform deployment. Prefer environment variables or a secrets manager over committing real keys. Pointing Cursor AI at Cloudera LLM So that Cursor’s AI features (chat, completions) use your Cloudera-hosted LLM instead of third-party APIs: In Cursor: Settings → Features → AI (or similar). Use a single custom OpenAI-compatible endpoint. Set: Base URL: same as your LLM base_url in config-llm.json (e.g. https://.../v1). API Key: same JWT/api key that works for that endpoint. Model: same model as in config (e.g. nvidia/llama-3.3-nemotron-super-49b-v1). Then Cursor’s model traffic goes to Cloudera AI Platform only. For project-scoped settings and locking Cursor to Cloudera-only traffic, see: Workspace / project-specific: CURSOR_QUICK_START.md (workspace profile, create_cursor_workspace.py). Cursor-only network rules: CURSOR_ONLY_ENFORCEMENT.md. Validation Test of Configuration In the CursorAI agent window enter the following prompt: Verify ClouderaAgent Configuration with models hosted in Cloudera AI Using the ClouderaAgent in Code Create the Agent from agents import create_cloudera_agent
# Embeddings + LLM (if config-llm.json or llm in config.json exists)
agent = create_cloudera_agent(use_llm=True)
# Embeddings only (no LLM)
agent = create_cloudera_agent(use_llm=False) With explicit paths: agent = create_cloudera_agent(
config_file="configs/config.json",
llm_config_file="configs/config-llm.json",
use_llm=True
) The agent is a ClouderaAgent instance: it has an embedding client, an in-memory vector store, and optionally an LLM client. Using Embeddings Only (Semantic Search) Use the agent’s embedding client for query and passage vectors, or use the built-in vector store. Direct embeddings: from agents import create_cloudera_agent
agent = create_cloudera_agent(use_llm=False)
# Query embedding (e.g. for search)
query_vec = agent.embedding_client.embed_query("How do I deploy a model?")
# Passage embedding (e.g. for indexing documents)
passage_vec = agent.embedding_client.embed_passage(
"To deploy a model, use the Cloudera AI Platform UI or API."
)
# Batch passages
texts = ["Document one...", "Document two..."]
vecs = agent.embedding_client.embed_batch(texts, use_passage=True) Search with the built-in vector store: agent.add_knowledge([
"Cloudera AI Platform supports NVIDIA Nemotron and E5 models.",
"You can deploy models via the UI or the API.",
])
results = agent.retrieve_context("How do I deploy a model?", top_k=3)
for r in results:
print(r["text"], r["similarity"]) ClouderaAgent + embeddings = semantic search over your own text, all via Cloudera-hosted models. Test Script And Expected Output Run the following test script $.venv/bin/python tests/test_llm_agent.py
======================================================================
Comprehensive LLM Agent Test
======================================================================
Creating Cloudera agent...
✅ Agent created successfully
LLM Configuration:
Model: nvidia/llama-3.3-nemotron-super-49b-v1.5
Base URL: https://<YOUR MODEL DOMAIN>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1
Test 1: Simple LLM Query (No Context)
----------------------------------------------------------------------
Query: What is machine learning? Answer in 2-3 sentences.
⏳ Sending request...
✅ Response received in 2.77 seconds
Answer:
<think>
Okay, the user is asking, "What is machine learning?" and wants a concise answer in 2-3 sentences. Let me start by recalling the basic definition. Machine learning is a subset of AI, right? It involves algorithms that allow computers to learn from data.
Wait, I should make sure to mention that it's a part of artificial intelligence. The key point is that instead of explicit programming, the system learns patterns from data. I need to highlight that it's used
Test 2: LLM with RAG (Context from Knowledge Base)
----------------------------------------------------------------------
Adding knowledge base documents...
✅ Added 4 documents
Query: What is RAG and how does it work?
⏳ Sending request with context...
✅ Response received in 3.66 seconds
Answer:
<think>
Okay, let's tackle the question about RAG. The user wants to know what RAG is and how it works. The context provided has two parts.
First, Context 1 says RAG combines semantic search with language generation. So, I need to explain that RAG stands for Retrieval Augmented Generation. The key here is that it's a combination of two things: retrieval and generation. The retrieval part must be about finding relevant information, and the generation part is about creating text, probably using a language model.
Then Context 2 mentions vector embeddings converting text into numerical representations for similarity search. This seems related to the retrieval part. Vector embeddings are used to represent text in a way that allows for measuring similarity between different
Context used:
[Context 1]: RAG (Retrieval Augmented Generation) combines semantic search with language generation.
[Context 2]: Vector embeddings convert text into numerical representations for similarity search....
Test 3: Different Temperature Settings
----------------------------------------------------------------------
Temperature: 0.1
Query: Write a creative one-sentence story about AI.
⏳ Sending request...
✅ Response (1.23s): <think>
Okay, the user wants a one-sentence story about AI. Let me think. It needs to be creative an...
Temperature: 0.7
Query: Write a creative one-sentence story about AI.
⏳ Sending request...
✅ Response (1.27s): <think>
Okay, the user wants a one-sentence story about AI. Let me think. It needs to be creative an...
Temperature: 0.9
Query: Write a creative one-sentence story about AI.
⏳ Sending request...
✅ Response (1.33s): <think>
Okay, the user wants a one-sentence story about AI. Let me think. It needs to be creative an...
Test 4: Multi-turn Conversation
----------------------------------------------------------------------
Simulating a conversation...
User: What is Python?
⏳ Sending request...
Assistant (1.25s): <think>
Okay, the user is asking, "What is Python?" Let me start by breaking down what I know about Python. First, I should define it in simple terms....
User: Can you give me a code example?
⏳ Sending request...
Assistant (2.44s): <think>
Okay, the user is asking for a code example. But they didn't specify the programming language or the problem they want to solve. I need to ask...
======================================================================
Test Summary
======================================================================
✅ PASS: Simple LLM
✅ PASS: LLM with RAG
✅ PASS: Temperature Settings
✅ PASS: Conversation
Total: 4/4 tests passed
✅ All LLM agent tests passed! Using LLM Only (No RAG) When the agent is created with `use_llm=True`, you can call the LLM directly (no retrieval): from agents import create_cloudera_agent
agent = create_cloudera_agent(use_llm=True)
result = agent.answer_with_llm(
"Explain in one sentence what RAG is.",
use_context=False,
temperature=0.3,
max_tokens=200,
)
print(result["answer"])
print(result["model"]) # e.g. nvidia/llama-3.3-nemotron-super-49b-v1 Test Script And Expected Output $.venv/bin/python hello_world/hello_world_simple.py
======================================================================
Hello World - Using Cloudera Agents
======================================================================
🔧 Initializing Cloudera agent...
✅ Agent initialized successfully!
💬 Asking LLM to say 'Hello, World!'...
⏳ This may take 30-60 seconds...
🤖 Response from Cloudera LLM:
----------------------------------------------------------------------
<think>
Okay, the user wants me to say 'Hello, World!' in a friendly way. Let me think about how to approach this.
First, the standard "Hello, World!" is straightforward, but making it friendly might require adding some
----------------------------------------------------------------------
✅ Hello World script completed successfully! Using RAG (Retrieval + LLM) RAG = retrieve relevant chunks from your knowledge base, then send them plus the user question to the LLM. Step 1 – Add knowledge (documents): agent.add_knowledge([
"Cloudera AI Platform provides embedding and LLM endpoints.",
"Cursor can use a custom OpenAI-compatible endpoint for chat.",
"The ClouderaAgent uses config.json and config-llm.json for endpoints.",
]) Step 2 – Ask with context (RAG): result = agent.answer_with_llm(
"How do I configure Cursor to use Cloudera?",
top_k=3,
use_context=True,
temperature=0.5,
)
print(result["answer"])
# Optional: inspect retrieved context
for ctx in result["context"]:
print(ctx["text"], ctx["similarity"]) ClouderaAgent + embeddings + LLM = RAG pipeline: Cloudera embeddings for retrieval, Cloudera LLM for the answer. There is an example script and output listed below under the section Test Embeddings. Validation - LLMs Hosted on Cloudera AI Inference Service UI End-to-End Example: Small RAG Script #!/usr/bin/env python3
"""Minimal RAG example using ClouderaAgent and Cloudera-hosted models."""
import sys
from pathlib import Path
# Ensure project root is on path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from agents import create_cloudera_agent
def main():
# 1. Create agent (reads configs/config.json + configs/config-llm.json)
agent = create_cloudera_agent(use_llm=True)
# 2. Add knowledge
agent.add_knowledge([
"Cloudera AI Platform hosts embedding and LLM models.",
"ClouderaAgent uses OpenAI-compatible APIs for embeddings and chat.",
"Cursor IDE can use a custom endpoint to talk to Cloudera LLMs.",
])
# 3. Ask with RAG
result = agent.answer_with_llm(
"What is ClouderaAgent and how does it relate to Cursor?",
top_k=2,
use_context=True,
temperature=0.4,
)
print(result["answer"])
return 0
if __name__ == "__main__":
sys.exit(main()) LLM code-writing tests (uses ClouderaAgent with real endpoint): $ uv run python tests/test_code_writing_manual.py --prompt "Write a function that calculates prime numbers"
Initializing ClouderaAgent...
✓ ClouderaAgent initialized successfully
Model: nvidia/llama-3.3-nemotron-super-49b-v1.5
Endpoint: https://<YOUR DOMAIN >/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1
======================================================================
Code Generation Test (ClouderaAgent)
======================================================================
Prompt: Write a function that calculates prime numbers
### ✅ Function: `primes_up_to(n)`
```python
def primes_up_to(n):
"""
Returns a list of all prime numbers up to and including n.
Parameters:
n (int): The upper limit (inclusive) for finding prime numbers.
Returns:
list: A list of prime numbers from 2 up to n.
"""
if n < 2:
return []
# Initialize a boolean list where index represents the number and value indicates if it's prime
sieve = [True] * (n + 1)
sieve[0] = sieve[1] = False # 0 and 1 are not prime
# Iterate from 2 up to the square root of n
for i in range(2, int(n ** 0.5) + 1):
if sieve[i]: # i is a prime number
# Mark all multiples of i starting from i*i as non-prime
for j in range(i * i, n + 1, i):
sieve[j] = False
# Collect all indices that are still marked as True (i.e., prime numbers)
return [i for i in range(n + 1) if sieve[i]]
```
---
### 📌 Example Usage
```python
print(primes_up_to(30))
# Output: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
```
---
### 🧠 How It Works
1. **Initialization**: A list `sieve` of size `n + 1` is created, initialized to `True`. This list will track which numbers are prime.
2. **Base Cases**: `0` and `1` are explicitly marked as non-prime.
3. **Marking Multiples**: For each number `i` starting from `2` up to the square root of `n`, if `i` is still marked as prime, all its multiples (starting from `i*i`) are marked as non-prime.
4. **Result Extraction**: Finally, all indices that remain `True` in the `sieve` list are collected and returned as the list of prime numbers.
---
### ✅ Summary
- Use `primes_up_to(n)` to get a list of all prime numbers up to `n`.
- Use `is_prime(n)` to check if a specific number `n` is prime.
- The Sieve of Eratosthenes is the most efficient method for generating all primes up to a large number.
Let me know if you'd like a version that generates primes indefinitely or finds the nth prime! Cloudera Agent Coding Example In Cursor Agent Window In the agent window, use the following prompt: Use the cloudera-agent to write a python script that find the first 10 prime numbers When the agent finishes, you will see the new python script created. In the image above, you will see the new script is called generate_first_10_primes.py. Test embeddings (RAG): $ uv run python examples/example_agent_usage.py
Creating Cloudera Agent...
2026-03-04 13:35:24,344 - agents.cloudera_agent - INFO - Loaded LLM configuration: model=nvidia/llama-3.3-nemotron-super-49b-v1.5, endpoint=https:<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1
2026-03-04 13:35:24,344 - agents.cloudera_agent - INFO - Creating ClouderaAgent with endpoint: https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1, models: nvidia/nv-embedqa-e5-v5-query, nvidia/nv-embedqa-e5-v5-passage
2026-03-04 13:35:24,400 - agents.cloudera_agent - INFO - Initialized ClouderaEmbeddingClient with endpoint: https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1
2026-03-04 13:35:24,400 - agents.cloudera_agent - INFO - Initialized SimpleVectorStore
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Initialized ClouderaAgent with LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Initialized ClouderaAgent with models: query=nvidia/nv-embedqa-e5-v5-query, passage=nvidia/nv-embedqa-e5-v5-passage
Agent created successfully!
Agent stats: {'num_documents': 0, 'embedding_dim': 1024, 'query_model': 'nvidia/nv-embedqa-e5-v5-query', 'passage_model': 'nvidia/nv-embedqa-e5-v5-passage', 'llm_model': 'nvidia/llama-3.3-nemotron-super-49b-v1.5', 'llm_endpoint': 'https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1'}
Adding knowledge base documents...
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Adding 7 documents to knowledge base
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Adding 7 documents to vector store
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Generating 7 passage embeddings using model: nvidia/nv-embedqa-e5-v5-passage
2026-03-04 13:35:24,682 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,728 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,773 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,829 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,875 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,920 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,975 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Successfully generated 7 embeddings
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Successfully added 7 documents. Total documents: 7
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Knowledge base now contains 7 documents
Added 7 documents to knowledge base
Running queries...
Query: What is Python?
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:25,020 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:34,523 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:35:34,540 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, let's see. The user is asking "What is Python?" and there are two contexts provided. Context 1 says Python is a high-level programming language known for its simplicity and readability. Context 2 talks about machine learning as a subset of AI.
The question is about Python, so I should focus on Context 1. Context 2 isn't directly relevant here. The answer should be based on the information given in Context 1. The user wants a concise answer, so I just need to restate what Context 1 says. There's no need to mention machine learning since that's in Context 2 and not related to the question. Let me check if there's any other info in the contexts that might be relevant, but no, Context 2 is about machine learning. So the answer should be that Python is a high-level programming language known for its simplicity and readability. I don't need to add anything else because the question doesn't ask for more details beyond what's provided in the context.
</think>
Answer: Python is a high-level programming language known for its simplicity and readability.
Used 2 context sources:
[1] Similarity: 0.4648
Text: Python is a high-level programming language known for its simplicity and readabi...
[2] Similarity: 0.2402
Text: Machine learning is a subset of artificial intelligence that enables systems to ...
Query: How does machine learning work?
2026-03-04 13:35:34,540 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:34,675 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:45,064 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:35:45,065 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, let's tackle the question "How does machine learning work?" using the provided contexts. First, I need to check what information is given in Context 1 and Context 2.
Context 1 says: "Machine learning is a subset of artificial intelligence that enables systems to learn from data." That's a basic definition. It mentions that ML is part of AI and that it allows systems to learn from data. But it doesn't go into any detail about the process, algorithms, or steps involved in how machine learning actually works.
Context 2 talks about Cloudera providing enterprise data cloud solutions for machine learning and analytics. This seems more about the application or the platform used for ML rather than explaining the mechanics of how ML works. It doesn't add any information about the learning process itself.
The question is asking for an explanation of how machine learning works. The contexts provided only give a high-level definition and mention a company that uses ML in their solutions. There's no information about algorithms, data training, model building, or any of the technical processes involved.
Since the user instructions say that if the context doesn't contain enough information to answer the question, I should state that. The answer can't be inferred from the given contexts because they don't explain the mechanisms or methods of machine learning. They only provide a definition and a company's application. Therefore, the correct response here is to say that the context doesn't have enough information to answer the question.
</think>
The provided context does not contain enough information to explain how machine learning works. Context 1 only offers a high-level definition of machine learning as a subset of AI that enables systems to learn from data, while Context 2 mentions Cloudera's role in providing enterprise data cloud solutions for ML and analytics. Neither context details the technical processes, algorithms, or methodologies involved in machine learning.
Answer: The context doesn't contain enough information to answer the question.
Used 2 context sources:
[1] Similarity: 0.4828
Text: Machine learning is a subset of artificial intelligence that enables systems to ...
[2] Similarity: 0.3154
Text: Cloudera provides enterprise data cloud solutions for machine learning and analy...
Query: What are embeddings used for?
2026-03-04 13:35:45,065 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:45,207 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:57,606 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:35:57,607 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, let's tackle this question. The user is asking, "What are embeddings used for?" and they provided two contexts.
First, looking at Context 1: "Embeddings are vector representations of text that capture semantic meaning." So, this tells me that embeddings are a way to represent text as vectors, and they capture the meaning of the text. That's a starting point. But the question is about their uses, not just their definition.
Context 2 mentions the NVIDIA nv-embedqa-e5-v5 model, which is designed for question-answering and semantic search tasks. Now, since this model uses embeddings (given the name and the context), the tasks it's designed for might indicate the uses of embeddings. So, question-answering and semantic search are two applications.
Putting it together, the answer should link embeddings to these tasks. The first context gives the definition, and the second gives specific applications. So, the answer should state that embeddings are used for tasks like question-answering and semantic search, as they capture semantic meaning which is essential for these tasks.
Wait, but should I mention both contexts? The first context explains what embeddings are, but the question is about their use. The second context provides specific uses. So combining both, the answer would be that embeddings are used for capturing semantic meaning, which enables tasks like question-answering and semantic search. However, the user's question is "What are embeddings used for?" which is more about the applications rather than their definition. So the answer should focus on the uses mentioned in Context 2, but since Context 1 is also provided, maybe it's better to mention that they capture semantic meaning which allows them to be used in those tasks. But the answer should be concise. Let me check the example answer given. The example answer says: "Embeddings are used for capturing semantic meaning in text, which enables tasks like question-answering and semantic search, as demonstrated by models such as NVIDIA's nv-embedqa-e5-v5." So that combines both contexts. Therefore, the answer should include both the purpose (capturing semantic meaning) and the specific applications (question-answering and semantic search) based on the contexts provided.
</think>
Answer: Embeddings are used for capturing semantic meaning in text, which enables tasks like question-answering and semantic search, as demonstrated by models such as NVIDIA's nv-embedqa-e5-v5.
Used 2 context sources:
[1] Similarity: 0.4917
Text: Embeddings are vector representations of text that capture semantic meaning....
[2] Similarity: 0.3127
Text: The NVIDIA nv-embedqa-e5-v5 model is designed for question-answering and semanti...
Query: Tell me about Cloudera
2026-03-04 13:35:57,608 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:57,750 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:36:01,439 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:36:01,440 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, the user is asking about Cloudera. Let me check the provided contexts. Context 1 says Cloudera offers enterprise data cloud solutions for machine learning and analytics. That's the main point. The other context is about Python, which doesn't seem relevant here. Since the question is specifically about Cloudera, I should focus on Context 1. There's no additional info needed beyond what's in Context 1. The answer should state that Cloudera provides those solutions. I need to make sure not to include anything from Context 2 since it's unrelated. Let me phrase it clearly and concisely.
</think>
Answer: Cloudera provides enterprise data cloud solutions for machine learning and analytics.
Used 2 context sources:
[1] Similarity: 0.4652
Text: Cloudera provides enterprise data cloud solutions for machine learning and analy...
[2] Similarity: 0.2139
Text: Python is a high-level programming language known for its simplicity and readabi...
Example completed! Test LLM: uv run python3 examples/example_llm_usage.py Or run the full pytest suite with a real agent: RUN_REAL_LLM_TESTS=1 uv run pytest tests/test_llm_code_writing.py::TestRealLLMCodeWriting -v Troubleshooting Symptom What to check Endpoint URL required / API key required Set CLOUDERA_EMBEDDING_URL, OPENAI_API_KEY (and for LLM: CLOUDERA_LLM_URL, CLOUDERA_LLM_MODEL) or fix configs/config.json and configs/config-llm.json. LLM not configured Ensure configs/config-llm.json exists with llm_endpoint.base_url and llm_endpoint.model, or that config.json has an llm_endpoint section, and use_llm=True. 404 from LLM Base URL must end with /v1 and match the exact deployment URL in Cloudera AI Platform (Deployments → Model Endpoints). Auth errors Regenerate or refresh the JWT/api key and update config or env. Cursor not using Cloudera In Cursor settings, confirm the single custom endpoint base URL, API key, and model name match your config-llm.json. For more detail on Cursor setup and network enforcement, see CURSOR_INTEGRATION_GUIDE.md and CURSOR_ONLY_ENFORCEMENT.md in project GitRepo. Best Practices Always call enforce_cloudera_models() at the start of your application Store your API keys securely using environment variables or a secrets manager Handle configuration errors gracefully in your application Monitor your Cloudera endpoint usage and set up alerts for any configuration issues
... View more
02-23-2026
08:58 AM
Government agencies and commercial entities must retain data for several years and commonly experience IT challenges due to increased data volumes and new sources coming online. Due to these factors, they are starting to undergo degradation in the performance of Security Information & Event Management (SIEM’s) tools like Splunk. To continue to meet mission needs, address the increase in data sources that require protection, and manage costs, they have started research strategies that complement their SIEM investments while looking for solutions that meet or exceed their organization’s policies. The following requirements must be met to increase the performance of Splunk and maximize IT investment: Route data to utilize cost-effective data destinations Preprocess incoming data by filtering and removing extraneous fields Aggregate logs Replay any data Easily move workloads to Splunk Cloud Standardize the onboarding of additional data sources Provide observability of data pipelines This blog will focus on how agencies use Cloudera Data Flow for universal data distribution as a solution for Splunk optimization with the technical details required to re-create this work. I will not provide details on how to deploy Cloudera Data Flow, instead I have provided sample data files and a template for a data flow, that can be found here. I have also included a data flow template demonstrating how to route, filter, and aggregate data from Windows Application Event or IP Stream logs. This approach is designed to be hybrid in nature, meaning it can be used on-premises, in the cloud, or as a combination of both. Due to the flexibility of this architecture, it can be slowly phased into an existing Splunk deployment without interrupting the service. Routing Data To Utilize Cost-Effective Data Destinations There are many reasons agencies are seeking to leverage more cost-effective data storage locations to continue to scale and meet business SLAs. They also need flexibility in terms of supporting future architectures and applications. Currently, object storage has become the industry standard due to its ability to: Grow to their needs as data volumes increase Continue to perform analytics at scale Allow for organizations to leverage data tiers (hot, medium, or cold) for storage optimization Facilitate a hybrid architecture There are many different software and hardware vendors that support their object storage products, and many of them provide an API that is compliant with the AWS S3 API. For this blog, I will be using the native capabilities of Cloudera Data Flow to demonstrate writing a file as an object into a specific bucket location. To complete this task, the NiFi processor PutS3Object is required. In this image, I will be using the QueryRecords NiFi processor to query and route the corresponding results to a file, which can be set to a different file format (CSV, JSON). Once the output file has been created, we can write this to a pre-existing object storage bucket. This bucket can be hosted on-premise or in AWS cloud infrastructure. In this set of images, I am showing the configuration values of the PutS3Object NiFi processor to communicate with the object storage instance. This example is using an on-premise deployment of object storage using Apache Ozone. The IP address host containing Ozone Manager is required to complete this. The user will also need to create a bucket in this object storage instance. If you’re using S3 in AWS’s cloud, you will need to provide the proper Region of the bucket, and you can leave the Endpoint Override URL value empty. In your environment, the user list may need to be set from your environment, but I am using the user “hadoop” for these values. Please note that I have security enforcement on the AWS S3 protocol turned off. If security enforcement is turned on, the user must provide the values for Access Key ID and Secret Access Key that match their AWS credentials. Please note that I have security enforcement on the AWS S3 protocol turned off. If security enforcement is turned on, the user must provide the values for Access Key ID and Secret Access Key that match their AWS credentials. The configuration fields that are important to enter correctly are: Object Key: ${filename} Bucket: bucket1 Storage Class: ReducedRedundancy Endpoint Override URL: http://<Object Store Host IP>:<Object Store Port> Use Path Style Access: true Access Key: testUser Secret Access Key: 123456 Pre-Process Incoming Data by Filtering and Removing Extraneous Fields To improve the performance of Splunk or other SIEMs, it is critical to be selective about the data sent to be indexed and searched in the future. There are many advantages to limiting the amount of data an analyst has to sift through later. With less data to be searched through, it allows for queries to be completed quicker, and critical events can be given more attention or immediate action. Concerning filtering data, Cloudera Data Flow can complete this task using native processors. Data can also be filtered based on any data or metadata associated with the flow file. In the following example, I will be using the event severity level as a filter to determine if data should be routed to our SIEM or object storage. This example will pull the severity level, which is a metadata value, and only send events that are non-information to our SIEM (i.e., Splunk). Information events will be routed to our object storage bucket. To complete this task, I will use the QueryRecord NiFi processor to query incoming records for severity level and event codes. This acts as a filter and routes the results based on the query. In this example, the following SQL statements are used for severity and event codes. Please note, the RPATH call is used to traverse through nested JSON elements in the input file. Severity - Information SELECT * FROM FLOWFILE WHERE RPATH ("result", '/severity') = 'Information' Severity - Warning SELECT * FROM FLOWFILE WHERE RPATH ("result", '/severity') = 'Warning' Event Code 102 SELECT * FROM FLOWFILE WHERE RPATH ("result", '/EventCode') = '102' Event Code 1001 SELECT * FROM FLOWFILE WHERE RPATH ("result", '/EventCode') = '1001' Removing unrequired or extra fields can be accomplished using the built-in SQL capabilities in the QueryRecords NiFi processor. Once those attributes are present in the query, they will be added to the output file that would be sent to a SIEM or an object storage instance. Aggregate Logs Log aggregation can enable organizations with greater control over the flow of data through their infrastructure as well as control over how data is written to its destination. In many ways, this can be considered transformation in an ETL process. By taking advantage of the native capabilities of the QueryRecords NiFi processor, the results from the queries are merged together out of the box. This allows developers to leverage all of the results of the query into a single output file. The following images show the results of the query merged into a single output file. Replay Any Data All NiFi processors can replay any data that has been processed through them. This capability is a valuable tool for development or debugging purposes and is very easy to use. In the following images, a user must right-click on the processor of interest and select the option to view data provenance. A list of events will be displayed to the user, and one of those events can be selected. Under the Content tab, the replay button is available, which can be clicked to replay that data. Easily Move Workloads to Splunk Cloud with Built-In Connectors Moving data into a cloud-based Splunk instance can be accomplished using the native NiFi built-in processor called PutSplunk. This processor easily allows organizations to push data from their on-premise or cloud environment to Splunk instances regardless of location. By taking advantage of this capability, SIEM workloads can be straightforwardly moved to the cloud in parallel as depicted below. The configuration of this processor must be set with the hostname of the Splunk instance. The following images display the processor and configuration screen. For this blog, the specific Splunk values have not been included. Standardize the Onboarding of Additional Data Sources One of the significant advantages of using Cloudera Data Flow is the ability to standardize how new data sources get collected and moved throughout an enterprise - build once, use many times. By giving teams a standard approach to ingesting new data sources, organizations can streamline their process to acquire new datasets and use data in their missions or feed downstream applications. This approach decouples data sources from their destinations and brings immediate business value. Provide Observability of Data Pipelines Organizations require observability of their data pipelines. Observability is critical in validating chains of custody for audits and validation, but it's also essential for performance. Cloudera Data Flow keeps a very granular level of detail about each piece of data that it ingests. According to the Apache NiFi documentation, “As the data is processed through the system, transformed, routed, split, aggregated, and distributed to other endpoints, an audit trail is created and stored within Apache NiFi's Provenance Repository.” This implies that any and all steps that are used to process data can be stored and tracked. We can select Data Provenance from the Global Menu. All provenance events will be listed, and the complete data provenance can be viewed by selecting the icon on the right-hand side of the event. The out-of-the-box observability capabilities of Cloudera Data Flow allow for a comprehensive view of data pipelines, which is critical for organizations today. The following images detail how the complete data provenance listing can be selected for all events. The last image displays all of the steps used for a particular event from the moment the data was acquired and ingested into Cloudera Data Flow. Conclusion Government agencies and commercial entities will need to continue to address growing requirements and IT challenges. They will need solutions that are flexible enough to enable hybrid deployments, possess the ability to scale to the growing volume of data, and complement existing IT investments like Splunk. In addition to these IT challenges, they will need an universal solution that can ingest data from new sources as they come online, such as IoT devices, and deliver data to destinations, such as cloud based applications or future storage devices. Due to these factors, using Cloudera Data Flow allows organizations to address the degradation in the performance of their SIEMs, and allow them to continue to meet and exceed the future needs of the mission. Get Started Today Wherever you are on your hybrid cloud journey, a first-class data distribution service is critical for successfully adopting a modern hybrid data stack. Cloudera Data Flow for the Public Cloud provides a universal, hybrid, and streaming-first data distribution service that enables customers to gain control of their data flows. Take our interactive product tour to get an impression of Cloudera Data Flow in action or sign up for a free trial.
... View more
08-02-2023
07:47 AM
4 Kudos
Introduction:
This article is designed to get the reader to call a deployed CML model from Apache NiFi or Cloudera Data Flow as quickly as possible. It will require using one of Cloudera's Applied Machine Learning Prototypes (AMPs) on Customer Churn and the Apache NiFi Flow provided in this article. (Git Repo Here)
Prerequisites:
Apache NiFi instance is running.
CML's AMP Customer Churn App is running with a Churn Model deployed.
Download the user data file and Apache Nifi flow from this repo.
The user data file needs to be added to the file location where NiFi's GetFile Processor has permissions.
Launch CML Churn AMP
In the AMP catalog, select Churn Modeling with skit-learn, and configure a new project. This will take 5-7 minutes to complete. You will need to deploy a model inside of CML, which isn't in the scope of this article. Once you finish, you will be able to see the following screen, which will provide the URL location and API key for the model.
NiFi Flow Overview:
This article will include a single customer file in JSON and an Apache NiFi Data flow, which will need to be configured to call the model deployed in CML. The following screenshot displays the template once it has been added to the working canvas.
Processor Configuration:
GetFile Processor
In the GetFile processor, the input directory needs to contain the customer JSON file that is included with the article.
EvaluateJSON Processor
The included Data Flow file has this processor configured, but this step is essential to appreciate. This processor will take values from the JSON file and add them as attribute values. Once these values are an attribute, they can be used to call the model in the next step.
InvokeHTTP Processor
You will need to add the URL location of the model deployed in CML, which is pictured above. You will also need to include the API key with the model call in the image below:
Successful Results
Once the processors have been configured, the model can be called the results can be displayed by opening the queue and selecting the response value.
Here is an example of deployed model's response.
... View more
12-20-2018
07:21 PM
4 Kudos
This article is designed to the steps need to convert an existing Spark pipeline model into a MLeap bundle. To follow the steps required to build the pipeline model used in the tutorial, please download the complete Apache Zeppelin notebook here on my GitHub site.
Tools Used In Tutorial
Spark 2.3
MLeap Spark 0.13
Step 1 - Load MLeap Spark Dependencies This dependency is required to convert Spark models into MLeap bundles. in the following section of code, decency files are added to my Zeppelin notebook. Keep in mind this document needs to be present on Maven. %dep
//In Zeppelin
//Run this before initializing Spark Context
z.load("ml.combust.mleap:mleap-spark_2.11:0.13.0")
Step 2 - Build Pipeline Model The Spark pipeline model needs to be built before it can be converted. In this section of code, a Spark pipeline model is built from previously defined stages. To view the complete code list, please download the notebook from my Github site. A pipeline model will be built once it is fitted against a Spark dataframe. %spark2
//In Zeppelin
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
// Build pipeline from previously defined stages
val pipeline = new Pipeline().setStages(Array(TFtokenizer, hashingTF, TFIDF))
// Fit pipeline on a dataframe
val PipeLineModel = pipeline.fit(df_abstracts)
// Save Pipeline to Disk (Optional)
PipeLineModel.write.overwrite().save("/tmp/spark-pipelineModel-tfidf")
// Save the dataframe's schema
val schema = df_abstracts.schema
Step 3 - Export Spark Pipeline Model Into MLeap Bundle (Zip and Directory) Using the MLeap libraries, the Spark pipeline model can be converted into MLeap bundle. In the following section of code, the Spark pipeline model is converted into a MLeap bundle. I have included code for zip and directory serialization. Check the file system to see if the files were created. %spark2
//In Zeppelin
import ml.combust.bundle.BundleFile
import ml.combust.bundle.serializer.SerializationFormat
import org.apache.spark.ml.mleap.SparkUtil
import ml.combust.mleap.spark.SparkSupport._
import org.apache.spark.ml.bundle._
import resource._
//Init Spark Bundle Context to MLeap
val sbc = SparkBundleContext().withDataset(PipeLineModel.transform(df_abstracts))
// Serialize Pipeline to Zip
// *** This will fail if the file exist ***
(for(bundlefile <- managed(BundleFile("jar:file:/tmp/MLeapModels/spark-nlp-tfidf-pipeline.zip"))) yield {PipeLineModel.writeBundle.save(bundlefile)(sbc).get}).tried.get
//Serialize Pipeline to Directory
for(bundle <- managed(BundleFile("file:/tmp/MLeapModels/spark-nlp-tfidf-pipeline-dir"))) {PipeLineModel.writeBundle.format(SerializationFormat.Json).sav(bundle)}
Step 4 - Import MLeap Bundle (Optional) An optional step is to load the converted MLeap bundle. In the follow section of code, the MLeap bundle can be loaded from file. %spark2
//In Zeppelin
import ml.combust.bundle.BundleFile
import ml.combust.mleap.runtime.MleapSupport._
import resource._
// Deserialize a zip bundle
val bundle = (for(bundleFile <- managed(BundleFile("jar:file:/tmp/MLeapModels/spark-nlp-tfidf-pipeline.zip"))) yield { bundleFile.loadMleapBundle().get}).opt.get
+
... View more
08-03-2018
06:54 PM
6 Kudos
This tutorial will be designed to extent my other work, and the goal is to provide the reader a frame work for a complete end-to-end solution for sentiment analysis on streaming Twitter data. The solution is going to use many different components in the HDF/HDP stack including NiFi, Solr, Spark, and Zeppelin, and it will also utilize libraries that are available in the Open Source Community. I have included the Reference Architecture, which depicts the order of events for the solution. *Note: Before starting this tutorial, the reader would be best served reading my other tutorials listed below: Twitter Sentiment using Spark Core NLP in Apache Zeppelin Connecting Solr to Spark - Apache Zeppelin Notebook *Prerequisites Established Apache Solr collections called "Tweets" and "TweetSentiment" Downloaded and installed Open Scoring Completed this Tutorial: Build and Convert a Spark NLP Pipeline into PMML in Apache Zeppelin Step 1 - Download and deploy NiFi flow. The NiFi Flow can be found here: tweetsentimentflow.xml Step 2 - In NiFi flow, configure "Get Garden Hose" NiFi processor with your valid Twitter API Credentials.
Customer Key Customer Secret Access Token Access Token Secret Step 3 - In NiFi flow, configure both of the "PutSolrContentStream" NiFi processors with location of your Solr instance. This is example the Solr instance is located on the same host. Step 4 - From the command line, start OpenScoring server. Please note, OpenScoring was downloaded to the /tmp directory in this example. $ java -jar /tmp/openscoring/openscoring-server/target/openscoring-server-executable-1.4-SNAPSHOT.jar --port 3333
Step 5 - Start building corpus of Tweets, which will be used to train the Spark Pipeline model in the following steps. In NiFi flow, turn on the processors that will move the raw Tweets into Solr. In the NiFi flow, make sure the "Get File" processor remains turned off until the model has been trained and deployed. Step 7 - Validate flow is working as expected by querying the Tweets Solr collection in the Solr UI. Step 8 - Follow the tutorial to Build and Convert a Spark NLP Pipeline into PMML in Apache Zeppelin and save PMML object as TweetPipeline.pmml to the same host that is running OpenScoring. Step 9 - Use OpenScoring to deploy PMML model based on Spark Pipeline. $ curl -X PUT --data-binary @TweetPipeline.pmml -H "Content-type: text/xml" http://localhost:3333/openscoring/model/TweetPipe Step 10 - In NiFi flow, configure the "InvokeHTTP" NiFi processor with the host information location of the OpenScoring API end point. Step 11 - In the NiFi Flow, enable all of the processors, and validate the flow is working as expected by querying the TweetSentiment Solr collection in the Solr UI.
... View more
Labels:
08-02-2018
04:13 PM
7 Kudos
This article is designed to extend my articles
Twitter Sentiment using Spark Core NLP in Apache Zeppelin and Connecting Solr to Spark - Apache Zeppelin Notebook
I have included the complete notebook on my Github site, which can be found on my
GitHub site.
Step 1 - Follow the tutorial in the provide articles above, and establish an Apache Solr collection called "tweets"
Step 2 - Verify the version of Apache Spark being used, and visit the
Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4 %spark2
sc
sc.version
</p><p>
Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before
the Spark Context has been initialized</p>
<pre>%dep
z.load("com.lucidworks.spark:spark-solr:jar:3.4.4")
//Must be used before SparkInterpreter (%spark2) initialized
//Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter
Step 4 - Download the Stanford CoreNLP libraries found on here: <a href="http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip">http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip</a>
Upzip the download and move it to the /tmp directory. Note: This can be accomplished
on the command line or the following Zeppelin paragraph will work as well
%sh
wget /tmp/stanford-corenlp-full-2018-02-27.zip http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip unzip /tmp/stanford-corenlp-full-2018-02-27.zip
Step 5 - In Zeppelin's
Interpreters configurations for Spark, include the following artifact:
/tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar
Step 6 - Include the following Spark dependencies for Stanford CoreNLP and Spark CoreNLP. Important
note: This needs to be run before the Spark Context has been initialized
%dep
z.load("edu.stanford.nlp:stanford-corenlp:3.9.1")
//In Spark Interper Settings Add the following artifact
// /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar
%dep
z.load("databricks:spark-corenlp:0.2.0-s_2.11")
Step 7 Include the following Spark dependencies for
JPMML-SparkML and JPMML-Model. Important note: This needs to be run before the Spark Context has been initialized. %dep
z.load("org.jpmml:jpmml-sparkml:1.4.5")
</p><pre>%dep
z.load("org.jpmml:pmml-model:1.4.5")
Step 8 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names:
"zkhost" -> "host-1.domain.com:2181,host-2.domain.com:2181,host-3.domain.com:2181/solr" %spark2
val options = Map( "collection" ->
"Tweets", "zkhost" -> "localhost:2181/solr",
// "query" -> "Keyword, 'More Keywords'"
)
val df = spark.read.format("solr").options(options).load df.cache()
</p><p>
Step 9 - Review results of the Solr query</p><pre>%spark2
df.count()
df.printSchema()
df.take(1)
</p><p>
Step 10 - Filter the
Tweets in the Spark DataFrame to ensure the Tweet text isn't null Once filter
has been completed, add the sentiment value to the tweets.</p>
<pre>%spark2
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import com.databricks.spark.corenlp.functions._
val df_TweetSentiment = df.filter("text_t is not null").select($"text_t", sentiment($"text_t").as('sentimentScore))
Step 11 - Valid
results
%spark2
df_TweetSentiment.cache()
df_TweetSentiment1.printSchema()
df_TweetSentiment1.take(1)
df_TweetSentiment1.count()
Step 12 - Build Stages
to build features that will be fed into a Logistic Regression model for
classification
Stage 1 -Regex Tokenizer will be used to separate
each word into individual "tokens"
Stage 2 -Count Vectorizer will count the number
of occurrences each token occurs in the text corpus
Stage 3 -Term frequency-inverse document
frequency (TF-IDF) is a feature vectorization method widely used in text mining
to reflect the importance of a term to a document in the corpus.
Stage 4 -Logistic
Regression for classification to predict sentiment score
%spark2
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer, RegexTokenizer, CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.classification.LogisticRegression
val tokenizer = new RegexTokenizer()
.setInputCol("text_t")
.setOutputCol("words")
.setPattern("\W+")
.setGaps(true)
val wordsData = tokenizer.transform(df_TweetSentiment)
val cvModel = new CountVectorizer()
.setInputCol("words").setOutputCol("rawFeatures")
.setMinDF(4)
.fit(wordsData)
val featurizedData = cvModel.transform(wordsData)
val idf = new IDF()
.setInputCol("rawFeatures")
.setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("sentimentScore", "features").show()
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
.setLabelCol("sentimentScore")
Step 13 - Build Spark Pipeline from Stages %spark2
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
val pipeline = new Pipeline()
.setStages(Array(tokenizer, cvModel, idfModel, lr))
val PipeLineModel = pipeline.fit(df_TweetSentiment)
//Save Pipeline to Disk (Optional)
PipeLineModel.write.overwrite().save("/tmp/spark-IDF-model")
val schema = df_TweetSentiment.schema Step 14 - Export Spark Pipeline to PMML using JPMML-SparkML %spark2
import org.jpmml.sparkml.PMMLBuilder
import java.io.File
val pmml = new PMMLBuilder(schema, PipeLineModel)
val file = pmml.buildFile(new File("/tmp/TweetPipeline.pmml"))
... View more
05-23-2018
06:00 PM
Something like this will do the trick assuming the column name is timestamp_s. This will create a new data frame with a new timestamp column added to it. The new timestamp format is defined by this string: "EEE MMM dd HH:mm:ss ZZZZZ yyyy" val df_second = df_First.withColumn("timestampZ", unix_timestamp($"timestamp_s","EEE MMM dd HH:mm:ss ZZZZZ yyyy").cast(TimestampType)).drop($"timestamp_s")
... View more
05-23-2018
03:58 PM
6 Kudos
This article is designed to extend the great work by @Ali Bajwa: Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive and my article Connecting Solr to Spark - Apache Zeppelin Notebook
I have included the complete notebook on my Github site, which can be found on my Github site.
Step 1 - Follow Ali's tutorial to establish an Apache Solr collection called "tweets"
Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4
%spark2
sc
sc.version
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@617d134a
res1: String = 2.2.0.2.6.4.0-91
Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before the Spark Context has been initialized.
%dep
z.load("com.lucidworks.spark:spark-solr:jar:3.4.4")
//Must be used before SparkInterpreter (%spark2) initialized
//Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter
Step 4 - Download the Stanford CoreNLP libraries found on here: http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip . Upzip the download and move it to the /tmp directory. Note: This can be accomplished on the command line or the following Zeppelin paragraph will work as well. %sh
wget -O /tmp/stanford-corenlp-full-2018-02-27.zip http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip /tmp/stanford-corenlp-full-2018-02-27.zip
Step 5 - In Zeppelin's Interpreters configurations for Spark, include the following artifact: /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar Step 6 - Include the following Spark dependencies for Stanford CoreNLP and Spark CoreNLP. Important note: This needs to be run before the Spark Context has been initialized. %dep
z.load("edu.stanford.nlp:stanford-corenlp:3.9.1")
//In Spark Interper Settings Add the following artifact
// /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar
%dep
z.load("databricks:spark-corenlp:0.2.0-s_2.11")
Step 7 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names: "zkhost" -> "host-1.domain.com:2181,host-2.domain.com:2181,host-3.domain.com:2181/solr",
%spark2
val options = Map(
"collection" -> "Tweets",
"zkhost" -> "localhost:2181/solr",
// "query" -> "Keyword, 'More Keywords'"
)
val df = spark.read.format("solr").options(options).load
df.cache()
Step 8 - Review results of the Solr query
%spark2
df.count()
df.printSchema()
df.take(1)
Step 9 - Filter the Tweets in the Spark DataFrame to ensure the timestamp and language aren't null. Once filter has been completed, add the sentiment value to the tweets. %spark2
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import com.databricks.spark.corenlp.functions._
val df_TweetSentiment = df.filter("text_t is not null and language_s = 'en' and timestamp_s is not null ").select($"timestamp_s", $"text_t", $"location", sentiment($"text_t").as('sentimentScore))
Step 10 - Correctly cast the timestamp value %spark2
val df_TweetSentiment1 = df_TweetSentiment.withColumn("timestampZ", unix_timestamp($"timestamp_s", "EEE MMM dd HH:mm:ss ZZZZZ yyyy").cast(TimestampType)).drop($"timestamp_s")
Step 11 - Valid results and create temporary table TweetSentiment df_TweetSentiment1.printSchema()
df_TweetSentiment1.take(1)
df_TweetSentiment1.count()
df_TweetSentiment1.cache()
df_TweetSentiment1.createOrReplaceTempView("TweetSentiment") Step 12 - Query the table TweetSentiment %sql
select sentimentScore, count(sentimentScore) from TweetSentiment group by sentimentScore
... View more
Labels:
05-23-2018
03:52 PM
4 Kudos
This article is designed to extend the great work by @Ali Bajwa: Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive
I have included the complete notebook on my Github site, which can be found here
Step 1 - Follow Ali's tutorial to establish an Apache Solr collection called "tweets"
Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4
%spark2
sc
sc.version
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@617d134a
res1: String = 2.2.0.2.6.4.0-91
Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before the Spark Context has been initialized. %dep
z.load("com.lucidworks.spark:spark-solr:jar:3.4.4")
//Must be used before SparkInterpreter (%spark2) initialized
//Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter
Step 4 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names: "zkhost" -> "host-1.domain.com:2181,host-2.domain.com:2181,host-3.domain.com:2181/solr", %spark2
val options = Map(
"collection" -> "Tweets",
"zkhost" -> "localhost:2181/solr",
// "query" -> "Keyword, 'More Keywords'"
)
val df = spark.read.format("solr").options(options).load
df.cache()
Step 5 - Review results of the Solr query %spark2
df.count()
df.printSchema()
df.take(1)
... View more