Developer Blogs

Cloudera AI Code Assistant Development using CursorAI and CAI Inference Service

avatar
Contributor

Cloudera AI Code Assistant Development Using CursorAI and CAI Inference Service

cursor_blog.png

As agencies build out their AI Factories, the biggest challenge is often integrating complex models into secure application development teams. By leveraging LLMs hosted on Cloudera AI Inference Service, organizations can work directly in IDE software development tools, such as  CursorAI, to provide AI-assisted coding, but without the risk of IP leakage, governance policy violations, and/or data sovereignty related issues.  This solution is enterprise-ready, developer-friendly, and provides seamless integration. 

Why

With the advent of AI assisted coding tools powered by LLMs completely changing the game, Organizations that are adopting the methodology to their software development practice are seeing orders of magnitude increase in productivity from their team, reduced time for features to get to market, and  increased team satisfaction.  In other words, this shift is redefining  how teams are structured and staffed, while accelerating the delivery of high-demand features.

Naturally, there are risks associated with all new methodologies, and this is not an exception.  By leveraging publicly-hosted LLMs by providers for code development, organizations run the risk of pushing secure or sensitive proprietary information as context in a prompt to the models.  This could not only expose information on the customer or user base of the Enterprise, but it also can include the application code as well.  This could be later used by nefarious actors to discover day one attacks or other black hat related activities.  

By leveraging LLMs hosted on Cloudera AI Inference Service, Organizations can work directly in IDE software development tools, such as  CursorAI, to provide AI-assisted coding, but without the risk of IP leakage, governance policy violations, and/or data sovereignty related issues.  Developers and practitioners get the speed and fluency of AI-native tooling without compromising on enterprise compliance, IP protection, or data sovereignty. 

What 

By enforcing software development teams to use only LLMs hosted on Cloudera AI Inference Service, Organizations can leverage AI coding assistants in air-gapped environments, meet compliance requirements, control cost, and adhere to security policies. 

CursorAI is a powerful framework for working with Cloudera-hosted AI models, providing seamless integration with Agent coding assistants for building intelligent applications. This library offers a unified interface to deploy, manage, and query models with enterprise-grade security and scalability.  Cloudera's coding agent helps you interact with and manage your AI workflows. It provides:

  • Natural language interface to work with models
  • Automated model selection and optimization
  • Integration with Cloudera's data platform
  • Secure access to enterprise AI code assistant capabilities

How

For this technical blog, I will be using the IDE CursorAI because the product offers features for agent based code assist, which can be configured to use models with custom end points.  Please feel free to read more about their product offering. There are other IDE options available to use that will follow a very similar process.

Who

This blog is designed and written for many different individuals which include technical practitioners, mission/product owners, and decision makers.  Technical practitioners, who are the builders of the software products, will be able to reproduce the results by following the detailed instructions below to dramatically speed up the software development process.  Product/Mission owners can leverage this document to meet their timelines for new product features and as well as new releases that could include bug fixes.  In other words, keep their customers happy.  Decision makers owners can leverage this document to architect AI Code Assist solutions in private environments allowing their teams to take advantage of the latest technical innovations while maintaining enterprise security and data sovereignty. 

Getting Started

In order to follow this blog, the user will need an instance of Cloudera AI with a LLM hosted in Cloudera Inference Service.  They will also need to be able to access or generate an API key to the models that are hosted there.  From there, a user would need to create a new project in CAI workbench using the following GitHub repo to leverage the prebuilt testing and validation scripts.  The GitHub repo can be found here

What You’re Working With:

  • Cursor – AI-powered IDE that can use custom, OpenAI-compatible endpoints for chat and completions.
  • Cloudera AI Platform – Hosts embedding and LLM endpoints (e.g. NVIDIA Nemotron, E5 embeddings) in your environment.
  • ClouderaAgent – Python agent in this repo that talks to those Cloudera endpoints: embeddings for semantic search and (optionally) an LLM for RAG and chat-style generation.

This guide focuses on using the ClouderaAgent in your code (embeddings, LLM, RAG) and on pointing Cursor at your Cloudera LLM so you can code with Cursor while all model traffic stays on Cloudera.

Prerequisites

  • Cloudera AI Platform access
  • Cloudera endpoint URLs (for embeddings and/or LLM)
  • API key (JWT token) from Cloudera
  • Cursor IDE installed (optional, for agent window integration)

Installation

# 1. Clone the repository
git clone https://github.com/BrooksIan/Cloudera-Inference-With-CursorAI.git
cd Cloudera-Inference-With-CusorAI

# 2. Create virtual environment
python3 -m venv venv

# 3. Activate virtual environment
# On macOS/Linux:
source venv/bin/activate

# On Windows:
# venv\Scripts\activate

# 4. Install dependencies
pip install -r requirements.txt

Configuration

The user will need to create or modify a configuration file, which contains the environment variables used by the project. 

Configuration Files

The ClouderaAgent reads:

  • Embeddings: configs/config.json (or env vars).
  • LLM (optional): configs/config-llm.json or an llm_endpoint section in config.json.

Embedding config (configs/config.json)

{
  "endpoint": {
    "base_url": "https://YOUR-CLOUDERA-SITE/namespaces/serving-default/endpoints/YOUR-EMBEDDING-ENDPOINT/v1"
  },
  "models": {
    "query_model": "nvidia/nv-embedqa-e5-v5-query",
    "passage_model": "nvidia/nv-embedqa-e5-v5-passage"
  },
  "api_key": "YOUR_JWT_OR_API_KEY",
  "embedding_dim": 1024
}

LLM config (configs/config-llm.json)

{
  "llm_endpoint": {
    "base_url": "https://YOUR-CLOUDERA-SITE/namespaces/serving-default/endpoints/YOUR-LLM-ENDPOINT/v1",
    "model": "nvidia/llama-3.3-nemotron-super-49b-v1"
  },
  "api_key": "YOUR_JWT_OR_API_KEY"
}

Use the exact endpoint paths and model IDs from your Cloudera AI Platform deployment. Prefer environment variables or a secrets manager over committing real keys.

Pointing Cursor AI at Cloudera LLM

So that Cursor’s AI features (chat, completions) use your Cloudera-hosted LLM instead of third-party APIs:

  1. In Cursor: Settings → Features → AI (or similar).
  2. Use a single custom OpenAI-compatible endpoint.
  3. Set:
    • Base URL: same as your LLM base_url in config-llm.json (e.g. https://.../v1).
    • API Key: same JWT/api key that works for that endpoint.
    • Model: same model as in config (e.g. nvidia/llama-3.3-nemotron-super-49b-v1).

Then Cursor’s model traffic goes to Cloudera AI Platform only. For project-scoped settings and locking Cursor to Cloudera-only traffic, see:

  • Workspace / project-specific: CURSOR_QUICK_START.md (workspace profile, create_cursor_workspace.py).
  • Cursor-only network rules: CURSOR_ONLY_ENFORCEMENT.md.

Validation Test of Configuration

In the CursorAI agent window enter the following prompt:

Verify ClouderaAgent Configuration with models hosted in Cloudera AI

CusorAgentWindow.png

Using the ClouderaAgent in Code

Create the Agent

from agents import create_cloudera_agent

# Embeddings + LLM (if config-llm.json or llm in config.json exists)
agent = create_cloudera_agent(use_llm=True)

# Embeddings only (no LLM)
agent = create_cloudera_agent(use_llm=False)

With explicit paths:

agent = create_cloudera_agent(
    config_file="configs/config.json",
    llm_config_file="configs/config-llm.json",
    use_llm=True
)

The agent is a ClouderaAgent instance: it has an embedding client, an in-memory vector store, and optionally an LLM client.

Using Embeddings Only (Semantic Search)

Use the agent’s embedding client for query and passage vectors, or use the built-in vector store.

Direct embeddings:

from agents import create_cloudera_agent

agent = create_cloudera_agent(use_llm=False)

# Query embedding (e.g. for search)
query_vec = agent.embedding_client.embed_query("How do I deploy a model?")

# Passage embedding (e.g. for indexing documents)
passage_vec = agent.embedding_client.embed_passage(
    "To deploy a model, use the Cloudera AI Platform UI or API."
)

# Batch passages
texts = ["Document one...", "Document two..."]
vecs = agent.embedding_client.embed_batch(texts, use_passage=True)

Search with the built-in vector store:

agent.add_knowledge([
    "Cloudera AI Platform supports NVIDIA Nemotron and E5 models.",
    "You can deploy models via the UI or the API.",
])

results = agent.retrieve_context("How do I deploy a model?", top_k=3)
for r in results:
    print(r["text"], r["similarity"])

ClouderaAgent + embeddings = semantic search over your own text, all via Cloudera-hosted models.

Test Script And Expected Output

Run the following test script

$.venv/bin/python tests/test_llm_agent.py

======================================================================
Comprehensive LLM Agent Test
======================================================================

Creating Cloudera agent...
✅ Agent created successfully

LLM Configuration:
  Model: nvidia/llama-3.3-nemotron-super-49b-v1.5
  Base URL: https://<YOUR MODEL DOMAIN>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1
Test 1: Simple LLM Query (No Context)
----------------------------------------------------------------------
Query: What is machine learning? Answer in 2-3 sentences.

⏳ Sending request...
✅ Response received in 2.77 seconds

Answer:
<think>
Okay, the user is asking, "What is machine learning?" and wants a concise answer in 2-3 sentences. Let me start by recalling the basic definition. Machine learning is a subset of AI, right? It involves algorithms that allow computers to learn from data.

Wait, I should make sure to mention that it's a part of artificial intelligence. The key point is that instead of explicit programming, the system learns patterns from data. I need to highlight that it's used

Test 2: LLM with RAG (Context from Knowledge Base)
----------------------------------------------------------------------
Adding knowledge base documents...
✅ Added 4 documents

Query: What is RAG and how does it work?

⏳ Sending request with context...
✅ Response received in 3.66 seconds

Answer:
<think>
Okay, let's tackle the question about RAG. The user wants to know what RAG is and how it works. The context provided has two parts.

First, Context 1 says RAG combines semantic search with language generation. So, I need to explain that RAG stands for Retrieval Augmented Generation. The key here is that it's a combination of two things: retrieval and generation. The retrieval part must be about finding relevant information, and the generation part is about creating text, probably using a language model.

Then Context 2 mentions vector embeddings converting text into numerical representations for similarity search. This seems related to the retrieval part. Vector embeddings are used to represent text in a way that allows for measuring similarity between different

Context used:
[Context 1]: RAG (Retrieval Augmented Generation) combines semantic search with language generation.

[Context 2]: Vector embeddings convert text into numerical representations for similarity search....

Test 3: Different Temperature Settings
----------------------------------------------------------------------

Temperature: 0.1
Query: Write a creative one-sentence story about AI.
⏳ Sending request...
✅ Response (1.23s): <think>
Okay, the user wants a one-sentence story about AI. Let me think. It needs to be creative an...


Temperature: 0.7
Query: Write a creative one-sentence story about AI.
⏳ Sending request...
✅ Response (1.27s): <think>
Okay, the user wants a one-sentence story about AI. Let me think. It needs to be creative an...


Temperature: 0.9
Query: Write a creative one-sentence story about AI.
⏳ Sending request...
✅ Response (1.33s): <think>
Okay, the user wants a one-sentence story about AI. Let me think. It needs to be creative an...

Test 4: Multi-turn Conversation
----------------------------------------------------------------------
Simulating a conversation...

User: What is Python?
⏳ Sending request...
Assistant (1.25s): <think>
Okay, the user is asking, "What is Python?" Let me start by breaking down what I know about Python. First, I should define it in simple terms....

User: Can you give me a code example?
⏳ Sending request...
Assistant (2.44s): <think>
Okay, the user is asking for a code example. But they didn't specify the programming language or the problem they want to solve. I need to ask...


======================================================================
Test Summary
======================================================================

  ✅ PASS: Simple LLM
  ✅ PASS: LLM with RAG
  ✅ PASS: Temperature Settings
  ✅ PASS: Conversation

Total: 4/4 tests passed

✅ All LLM agent tests passed!

Using LLM Only (No RAG)

When the agent is created with `use_llm=True`, you can call the LLM directly (no retrieval):

from agents import create_cloudera_agent

agent = create_cloudera_agent(use_llm=True)

result = agent.answer_with_llm(
    "Explain in one sentence what RAG is.",
    use_context=False,
    temperature=0.3,
    max_tokens=200,
)

print(result["answer"])
print(result["model"])  # e.g. nvidia/llama-3.3-nemotron-super-49b-v1

Test Script And Expected Output

$.venv/bin/python hello_world/hello_world_simple.py

======================================================================
Hello World - Using Cloudera Agents
======================================================================

🔧 Initializing Cloudera agent...
✅ Agent initialized successfully!

💬 Asking LLM to say 'Hello, World!'...
   ⏳ This may take 30-60 seconds...

🤖 Response from Cloudera LLM:
----------------------------------------------------------------------
<think>
Okay, the user wants me to say 'Hello, World!' in a friendly way. Let me think about how to approach this.

First, the standard "Hello, World!" is straightforward, but making it friendly might require adding some
----------------------------------------------------------------------

✅ Hello World script completed successfully!

Using RAG (Retrieval + LLM)

RAG = retrieve relevant chunks from your knowledge base, then send them plus the user question to the LLM.

Step 1 – Add knowledge (documents):

agent.add_knowledge([
    "Cloudera AI Platform provides embedding and LLM endpoints.",
    "Cursor can use a custom OpenAI-compatible endpoint for chat.",
    "The ClouderaAgent uses config.json and config-llm.json for endpoints.",
])

Step 2 – Ask with context (RAG):

result = agent.answer_with_llm(
    "How do I configure Cursor to use Cloudera?",
    top_k=3,
    use_context=True,
    temperature=0.5,
)

print(result["answer"])
# Optional: inspect retrieved context

for ctx in result["context"]:
    print(ctx["text"], ctx["similarity"])

ClouderaAgent + embeddings + LLM = RAG pipeline: Cloudera embeddings for retrieval, Cloudera LLM for the answer.  There is an example script and output listed below under the section Test Embeddings. 

Validation - LLMs Hosted on Cloudera AI Inference Service UI

validate_models.png

 

End-to-End Example: Small RAG Script

#!/usr/bin/env python3

"""Minimal RAG example using ClouderaAgent and Cloudera-hosted models."""

import sys
from pathlib import Path

# Ensure project root is on path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

from agents import create_cloudera_agent

def main():
    # 1. Create agent (reads configs/config.json + configs/config-llm.json)
    agent = create_cloudera_agent(use_llm=True)

    # 2. Add knowledge
    agent.add_knowledge([
        "Cloudera AI Platform hosts embedding and LLM models.",
        "ClouderaAgent uses OpenAI-compatible APIs for embeddings and chat.",
        "Cursor IDE can use a custom endpoint to talk to Cloudera LLMs.",
    ])

    # 3. Ask with RAG
    result = agent.answer_with_llm(
        "What is ClouderaAgent and how does it relate to Cursor?",
        top_k=2,
        use_context=True,
        temperature=0.4,
    )

    print(result["answer"])
    return 0

if __name__ == "__main__":
    sys.exit(main())

LLM code-writing tests (uses ClouderaAgent with real endpoint):  

$ uv run python tests/test_code_writing_manual.py --prompt "Write a function that calculates prime numbers"

Initializing ClouderaAgent...
✓ ClouderaAgent initialized successfully
  Model: nvidia/llama-3.3-nemotron-super-49b-v1.5
  Endpoint: https://<YOUR DOMAIN >/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1


======================================================================
Code Generation Test (ClouderaAgent)
======================================================================

Prompt: Write a function that calculates prime numbers

### ✅ Function: `primes_up_to(n)`

```python
def primes_up_to(n):
    """
    Returns a list of all prime numbers up to and including n.

    Parameters:
    n (int): The upper limit (inclusive) for finding prime numbers.

    Returns:
    list: A list of prime numbers from 2 up to n.
    """
    if n < 2:
        return []

    # Initialize a boolean list where index represents the number and value indicates if it's prime
    sieve = [True] * (n + 1)
    sieve[0] = sieve[1] = False  # 0 and 1 are not prime

    # Iterate from 2 up to the square root of n
    for i in range(2, int(n ** 0.5) + 1):
        if sieve[i]:  # i is a prime number
            # Mark all multiples of i starting from i*i as non-prime
            for j in range(i * i, n + 1, i):
                sieve[j] = False

    # Collect all indices that are still marked as True (i.e., prime numbers)
    return [i for i in range(n + 1) if sieve[i]]
```

---

### 📌 Example Usage

```python
print(primes_up_to(30))
# Output: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
```

---

### 🧠 How It Works

1. **Initialization**: A list `sieve` of size `n + 1` is created, initialized to `True`. This list will track which numbers are prime.
2. **Base Cases**: `0` and `1` are explicitly marked as non-prime.
3. **Marking Multiples**: For each number `i` starting from `2` up to the square root of `n`, if `i` is still marked as prime, all its multiples (starting from `i*i`) are marked as non-prime.
4. **Result Extraction**: Finally, all indices that remain `True` in the `sieve` list are collected and returned as the list of prime numbers.

---

### ✅ Summary

- Use `primes_up_to(n)` to get a list of all prime numbers up to `n`.
- Use `is_prime(n)` to check if a specific number `n` is prime.
- The Sieve of Eratosthenes is the most efficient method for generating all primes up to a large number.

Let me know if you'd like a version that generates primes indefinitely or finds the nth prime!

Cloudera Agent Coding Example In Cursor Agent Window

Screenshot 2026-03-04 at 11.47.22 AM.png

In the agent window, use the following prompt:

Use the cloudera-agent to write a python script that find the first 10 prime numbers 

When the agent finishes, you will see the new python script created.  In the image above, you will see the new script is called generate_first_10_primes.py.  

Test embeddings (RAG):

$ uv run python examples/example_agent_usage.py

Creating Cloudera Agent...
2026-03-04 13:35:24,344 - agents.cloudera_agent - INFO - Loaded LLM configuration: model=nvidia/llama-3.3-nemotron-super-49b-v1.5, endpoint=https:<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1
2026-03-04 13:35:24,344 - agents.cloudera_agent - INFO - Creating ClouderaAgent with endpoint: https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1, models: nvidia/nv-embedqa-e5-v5-query, nvidia/nv-embedqa-e5-v5-passage
2026-03-04 13:35:24,400 - agents.cloudera_agent - INFO - Initialized ClouderaEmbeddingClient with endpoint: https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1
2026-03-04 13:35:24,400 - agents.cloudera_agent - INFO - Initialized SimpleVectorStore
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Initialized ClouderaAgent with LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Initialized ClouderaAgent with models: query=nvidia/nv-embedqa-e5-v5-query, passage=nvidia/nv-embedqa-e5-v5-passage
Agent created successfully!
Agent stats: {'num_documents': 0, 'embedding_dim': 1024, 'query_model': 'nvidia/nv-embedqa-e5-v5-query', 'passage_model': 'nvidia/nv-embedqa-e5-v5-passage', 'llm_model': 'nvidia/llama-3.3-nemotron-super-49b-v1.5', 'llm_endpoint': 'https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1'}

Adding knowledge base documents...
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Adding 7 documents to knowledge base
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Adding 7 documents to vector store
2026-03-04 13:35:24,418 - agents.cloudera_agent - INFO - Generating 7 passage embeddings using model: nvidia/nv-embedqa-e5-v5-passage
2026-03-04 13:35:24,682 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,728 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,773 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,829 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,875 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,920 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,975 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Successfully generated 7 embeddings
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Successfully added 7 documents. Total documents: 7
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Knowledge base now contains 7 documents
Added 7 documents to knowledge base

Running queries...

Query: What is Python?
2026-03-04 13:35:24,976 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:25,020 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:34,523 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:35:34,540 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, let's see. The user is asking "What is Python?" and there are two contexts provided. Context 1 says Python is a high-level programming language known for its simplicity and readability. Context 2 talks about machine learning as a subset of AI.

The question is about Python, so I should focus on Context 1. Context 2 isn't directly relevant here. The answer should be based on the information given in Context 1. The user wants a concise answer, so I just need to restate what Context 1 says. There's no need to mention machine learning since that's in Context 2 and not related to the question. Let me check if there's any other info in the contexts that might be relevant, but no, Context 2 is about machine learning. So the answer should be that Python is a high-level programming language known for its simplicity and readability. I don't need to add anything else because the question doesn't ask for more details beyond what's provided in the context.
</think>

Answer: Python is a high-level programming language known for its simplicity and readability.
Used 2 context sources:
  [1] Similarity: 0.4648
      Text: Python is a high-level programming language known for its simplicity and readabi...
  [2] Similarity: 0.2402
      Text: Machine learning is a subset of artificial intelligence that enables systems to ...

Query: How does machine learning work?
2026-03-04 13:35:34,540 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:34,675 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:45,064 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:35:45,065 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, let's tackle the question "How does machine learning work?" using the provided contexts. First, I need to check what information is given in Context 1 and Context 2.

Context 1 says: "Machine learning is a subset of artificial intelligence that enables systems to learn from data." That's a basic definition. It mentions that ML is part of AI and that it allows systems to learn from data. But it doesn't go into any detail about the process, algorithms, or steps involved in how machine learning actually works.

Context 2 talks about Cloudera providing enterprise data cloud solutions for machine learning and analytics. This seems more about the application or the platform used for ML rather than explaining the mechanics of how ML works. It doesn't add any information about the learning process itself.

The question is asking for an explanation of how machine learning works. The contexts provided only give a high-level definition and mention a company that uses ML in their solutions. There's no information about algorithms, data training, model building, or any of the technical processes involved. 

Since the user instructions say that if the context doesn't contain enough information to answer the question, I should state that. The answer can't be inferred from the given contexts because they don't explain the mechanisms or methods of machine learning. They only provide a definition and a company's application. Therefore, the correct response here is to say that the context doesn't have enough information to answer the question.
</think>

The provided context does not contain enough information to explain how machine learning works. Context 1 only offers a high-level definition of machine learning as a subset of AI that enables systems to learn from data, while Context 2 mentions Cloudera's role in providing enterprise data cloud solutions for ML and analytics. Neither context details the technical processes, algorithms, or methodologies involved in machine learning. 

Answer: The context doesn't contain enough information to answer the question.
Used 2 context sources:
  [1] Similarity: 0.4828
      Text: Machine learning is a subset of artificial intelligence that enables systems to ...
  [2] Similarity: 0.3154
      Text: Cloudera provides enterprise data cloud solutions for machine learning and analy...

Query: What are embeddings used for?
2026-03-04 13:35:45,065 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:45,207 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:35:57,606 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:35:57,607 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, let's tackle this question. The user is asking, "What are embeddings used for?" and they provided two contexts.

First, looking at Context 1: "Embeddings are vector representations of text that capture semantic meaning." So, this tells me that embeddings are a way to represent text as vectors, and they capture the meaning of the text. That's a starting point. But the question is about their uses, not just their definition.

Context 2 mentions the NVIDIA nv-embedqa-e5-v5 model, which is designed for question-answering and semantic search tasks. Now, since this model uses embeddings (given the name and the context), the tasks it's designed for might indicate the uses of embeddings. So, question-answering and semantic search are two applications.

Putting it together, the answer should link embeddings to these tasks. The first context gives the definition, and the second gives specific applications. So, the answer should state that embeddings are used for tasks like question-answering and semantic search, as they capture semantic meaning which is essential for these tasks.

Wait, but should I mention both contexts? The first context explains what embeddings are, but the question is about their use. The second context provides specific uses. So combining both, the answer would be that embeddings are used for capturing semantic meaning, which enables tasks like question-answering and semantic search. However, the user's question is "What are embeddings used for?" which is more about the applications rather than their definition. So the answer should focus on the uses mentioned in Context 2, but since Context 1 is also provided, maybe it's better to mention that they capture semantic meaning which allows them to be used in those tasks. But the answer should be concise. Let me check the example answer given. The example answer says: "Embeddings are used for capturing semantic meaning in text, which enables tasks like question-answering and semantic search, as demonstrated by models such as NVIDIA's nv-embedqa-e5-v5." So that combines both contexts. Therefore, the answer should include both the purpose (capturing semantic meaning) and the specific applications (question-answering and semantic search) based on the contexts provided.
</think>

Answer: Embeddings are used for capturing semantic meaning in text, which enables tasks like question-answering and semantic search, as demonstrated by models such as NVIDIA's nv-embedqa-e5-v5.
Used 2 context sources:
  [1] Similarity: 0.4917
      Text: Embeddings are vector representations of text that capture semantic meaning....
  [2] Similarity: 0.3127
      Text: The NVIDIA nv-embedqa-e5-v5 model is designed for question-answering and semanti...

Query: Tell me about Cloudera
2026-03-04 13:35:57,608 - agents.cloudera_agent - INFO - Processing query with RAG (top_k: 2, use_context: True)
2026-03-04 13:35:57,750 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---e5-embedding/v1/embeddings "HTTP/1.1 200 OK"
2026-03-04 13:36:01,439 - httpx - INFO - HTTP Request: POST https://<YOUR DOMAIN HERE>/namespaces/serving-default/endpoints/goes---nemotron-v1-5-49b-throughput/v1/chat/completions "HTTP/1.1 200 OK"
2026-03-04 13:36:01,440 - agents.cloudera_agent - INFO - Query processed. Generated answer using LLM: nvidia/llama-3.3-nemotron-super-49b-v1.5
Answer: <think>
Okay, the user is asking about Cloudera. Let me check the provided contexts. Context 1 says Cloudera offers enterprise data cloud solutions for machine learning and analytics. That's the main point. The other context is about Python, which doesn't seem relevant here. Since the question is specifically about Cloudera, I should focus on Context 1. There's no additional info needed beyond what's in Context 1. The answer should state that Cloudera provides those solutions. I need to make sure not to include anything from Context 2 since it's unrelated. Let me phrase it clearly and concisely.
</think>

Answer: Cloudera provides enterprise data cloud solutions for machine learning and analytics.
Used 2 context sources:
  [1] Similarity: 0.4652
      Text: Cloudera provides enterprise data cloud solutions for machine learning and analy...
  [2] Similarity: 0.2139
      Text: Python is a high-level programming language known for its simplicity and readabi...

Example completed!

Test LLM:

uv run python3 examples/example_llm_usage.py

Or run the full pytest suite with a real agent:

RUN_REAL_LLM_TESTS=1 uv run pytest tests/test_llm_code_writing.py::TestRealLLMCodeWriting -v

Troubleshooting

Symptom

What to check

Endpoint URL required / API key required

Set CLOUDERA_EMBEDDING_URL, OPENAI_API_KEY (and for LLM: CLOUDERA_LLM_URL, CLOUDERA_LLM_MODEL) or fix configs/config.json and configs/config-llm.json.

LLM not configured

Ensure configs/config-llm.json exists with llm_endpoint.base_url and llm_endpoint.model, or that config.json has an llm_endpoint section, and use_llm=True.

404 from LLM

Base URL must end with /v1 and match the exact deployment URL in Cloudera AI Platform (Deployments → Model Endpoints).

Auth errors

Regenerate or refresh the JWT/api key and update config or env.

Cursor not using Cloudera

In Cursor settings, confirm the single custom endpoint base URL, API key, and model name match your config-llm.json.

For more detail on Cursor setup and network enforcement, see CURSOR_INTEGRATION_GUIDE.md and CURSOR_ONLY_ENFORCEMENT.md in project GitRepo.

Best Practices

  1. Always call enforce_cloudera_models() at the start of your application
  2. Store your API keys securely using environment variables or a secrets manager
  3. Handle configuration errors gracefully in your application
  4. Monitor your Cloudera endpoint usage and set up alerts for any configuration issues

 

 

 

Contributors