Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

This guide is Part 2 of our series on GRPO-based fine-tuning. It assumes you are already familiar with the core concepts of reasoning models and GRPO (see Part 1 Understanding Reasoning Models with GRPO: A Conceptual Introduction for Building your own Med... ), and you are ready to apply them in practice using Python. All code is provided in a reproducible notebook, with detailed explanations and references to help you get started.


Abstract: In this guide, we’ll walk step by step through fine tuning a large language model on a medical reasoning dataset from Hugging Face, using Group Relative Policy Optimization (GRPO). In the following sections, you will see exactly how the technical setup of GRPO works, how the data is structured, how to create custom reward functions, and how to test, evaluate, and save the medical reasoning model. By the end, you will have a working GRPO pipeline for building your own medical reasoning model that produces
<reasoning>..</reasoning> as well as <answer>..</answer> outputs tailored for medical questions - complete with code snippets, narrative explanations, and links to key resources.


Below is the guide outline. Each section builds on the previous one and by the end, you will not only understand the theory behind GRPO but also have a fully reproducible medical‐reasoning reinforcement fine-tuning pipeline.

  1. Fundamentals of GRPO
    Discover what Group Relative Policy Optimization truly entails, understand why it’s a powerful Reinforcement Learning (RL) method for LLMs, and explore a high-level pseudocode sketch to solidify the details.

  2. Adapting GRPO to Your Dataset
    Learn how to prepare any dataset—specifically, a medical-reasoning SFT corpus— for RL. This includes steps for loading data via Hugging Face to enforcing <reasoning><answer> formatting.

  3. Defining Reward Functions
    Dive into the three core reward signals:semantic correctness, fluency measured through perplexity, and tag presence. Understand  how they are combined, normalized, and clamped to guide your model.

  4. Configuring Training
    Get a detailed walkthrough of the GRPOConfig and GRPOTrainer settings, including learning rates, batch sizes, generation counts, and hardware requirements, to ensure your training process is both efficient and stable.

  5. Executing the Training Loop
    This is where you initiate  trainer.train(), monitor the increasing rewards, and analyze the step-by-step logs to ensure your model is learning effectively.

  6. Testing the Fine-Tuned Model
    Save and reload your LoRA weights, then compare “before” and “after” outputs on sample medical questions to validate the structured <reasoning><answer> behavior.

  7. Evaluating with an LLM Judge
    Scale your assessment by using an LLM (e.g. GPT-4o-mini) as a judge: Test  100 sample cases, score outputs on medical accuracy, reasoning clarity, format adherence, fluency, and overall usefulness, and then aggregate the results.

  8. Saving Your Model
    Learn how to merge 4-bit + LoRA to 16-bit, upload the model to the Hugging Face Hub (including GGUF formats), and integrate it within applications like Cloudera AI Workbench.

  9. Impact & Takeaways
    Reflect on the VRAM savings, improvements in structured reasoning, and real-world scenarios. Additionally, explore tips for experimenting with group sizes, reward adjustments, and KL penalties.

  10. Conclusion
    Summarize what you’ve accomplished, understand why GRPO works so well for structured generation tasks, and explore  next steps to adapt and extend this workflow to new domains.

1. Fundamentals of GRPO

Group Relative Policy Optimization (GRPO) is a reinforcement-learning method designed to steer language models toward desired behaviors by leveraging groupwise feedback instead of relying on individual examples. During each training iteration, the model:

  1. Generates a set of candidate outputs for each prompt.

  2. Scores each candidate using one or more custom reward functions.

  3. Calculates advantages relative to the group (i.e. how much better or worse each output is compared to the batch mean).

  4. Performs a clipped policy update that reinforces above-average outputs while applying a Kullback-Leibler (KL) divergence penalty to maintain stability.


By comparing candidates
within each group, GRPO amplifies useful behaviors (e.g. clear reasoning, correct answers, proper formatting) without requiring manually crafted labels for every training example.

  • GRPO in a Nutshell

    • What: A reinforcement learning approach that fine-tunes models by rewarding entire generations using custom reward signals.
    • How: It rewards desired output features, similar to how a student learns from feedback.
    • Why: The model refines its responses based on these tailored rewards.

  • Reward Functions

    • Definition: Functions that score the model’s outputs.
    • Purpose: Evaluate outputs for correctness, format, and additional criteria (e.g. numeric accuracy).
    • Examples:
      • Check if the answer is correct.
      • Verify that the response adheres to an XML-like format.

1.1 GRPO Algorithm in Pseudocode

To implement GRPO, generate multiple responses, score them using reward functions, compare them within a batch, and update the LLM based on the best responses. 

  • Step 1: Generate Multiple Responses: The LLM outputs several different answers for the same prompt.
  • Step 2: Assign Rewards: Each response is evaluated and scored with a reward based on reasoning depth, formatting, and clinical accuracy.
  • Step 3: Compare Within the Group: Responses are compared to the group's average, and those that perform above average are reinforced.
  • Step 4: Optimize the Model: The LLM is fine-tuned to prioritize better responses using policy optimization.
  • Step 5: Ensure Stability: Kullback-Leibler (KL) Divergence Regularization is applied to prevent the model from undergoing drastic changes while still improving its performance. This safeguards against over-optimization, which may lead to reward hacking, where the model exploits the reward function instead of genuinely improving.

A visual representation of the RL pipeline used in DeepSeek-R1-Zero is illustrated below. This was the first open-source model to demonstrate that advanced reasoning capabilities can be developed purely through reinforcement learning. Without pre-labeled datasets, the model learns through trial and error, refining its behavior, parameters, and weights based solely on feedback from the solutions it generates.

Screenshot 2025-07-16 at 12.54.10 PM.png
Fig 1: RL pipeline used in Deepseek-R1-Zero,
Reference: A Visual Guide to Reasoning LLMs

Now that we have outlined the key components of GRPO, let’s look at the algorithm in pseudocode. While simplified,  this version highlights the core concepts..

Input

- initial_policy: Starting model to be trained

- reward_function: Function that evaluates the quality of outputs

- training_prompts: Set of training examples

- group_size: Number of outputs generated per prompt (typically 4-16)

Algorithm for GRPO:

  1. For each training iteration:
  2. Set reference_policy = initial_policy (snapshot the current policy)
  3. For each prompt in batch:
  4. Generate group_size different outputs using initial_policy
  5. Compute rewards for each output using reward_function

      iii. Normalize rewards within group:

           normalized_advantage = (reward - mean(rewards)) / std(rewards)

  1. Update the policy by maximizing the clipped ratio:

          min(prob_ratio * normalized_advantage, 

              clip(prob_ratio, 1-epsilon, 1+epsilon) * normalized_advantage)

          - kl_weight * KL(initial_policy || reference_policy)

          where prob_ratio is current_prob / reference_prob

Output: Optimized policy model

Reference: https://huggingface.co/learn/llm-course/chapter12/3

GRPO’s key innovations are:

  • Learning directly from any function or model. GPRO eliminates the reliance for  a separate reward, unlike methods such as PPO.
  • Group-based learning, which is more stable and efficient than traditional methods like pairwise comparisons

This shows how GRPO combines group-based advantage estimation with policy optimization while maintaining stability through clipping and KL divergence constraints. By comparing each candidate with its peers and carefully regularizing updates, GRPO provides a stable yet adaptive reinforcement learning fine-tuning algorithm, well-suited for structured generation tasks, like medical reasoning.

2. Adapting GRPO to Your Dataset

This guide walks you through the process of adapting GRPO to your specific dataset, using a combination of tutorials, examples, and practical steps. 

2.1 Quickstart Tutorial

If you're eager to start implementing GRPO, our example notebook is an excellent resource to get you up and running quickly. The notebook demonstrates GRPO training using the Llama 3.1 (8B Instruct) base model and includes step-by-step instructions for: 

  • Installing Unsloth and its dependencies
    Setup Unsloth for accelerated fine tuning while minimizing costs.
  • Configuring GRPO settings
    Use our pre-selected optimal parameters, or customize the settings for your specific dataset, model, and hardware constraints.
  • Selecting your dataset
    We are using an open medical reasoning dataset in this example, which is ideal for medical reasoning tasks. However, you can substitute your own dataset provided it contains at least two columns—one for questions and one for answers (with the answer omitting their reasoning process).
  • Reward Functions
    The example notebook uses rewards for semantic correctness, perplexity, and tag presence. Other common options include correctness, format, and additional criteria (e.g. numeric responses).

2.2 Prepare Your Own Dataset

  • Data Collection:

    • If you have a custom dataset for your application, such as  a medical dataset, you can adapt GRPO to fine-tune a model to specialize in that domain. Below, we outline key considerations for preparing a dataset tailored to medical reasoning tasks.

    • For our medical reasoning model, we want a dataset that has the following characteristics:
  1. Covers a breadth of medical reasoning scenarios
  2. Contains columns for question and answer pairs, with the answer column not revealing the reasoning behind its derivation
  3. A system prompt for reasoning to be included so that each row of the dataset is a dictionary with a system prompt enforcing structured reasoning, and question-answer pairs

  • Given that medical applications have their own terminology, we select an open source medical dataset to gather expert-level content about the problem domain. For our medical use case demonstration, a dataset focusing on complex medical reasoning tasks is used for GRPO training from Hugging Face: 'FreedomIntelligence/medical-o1-reasoning-SFT'

  • Format the Data:
    Structure your dataset into clear question-and-answer pairs. For example:

    • Question: "Is Aspirin good for cardio vascular function?"
    • Answer: "Aspirin can be beneficial for cardiovascular function, especially for secondary prevention (after a heart attack or stroke), but its use for primary prevention (to prevent a first heart attack or stroke) is now more carefully considered due to potential risks like bleeding."

  • Update the Data Loader:
    Write a loader that reads your custom data file. For example, if your data is a Hugging Face dataset, you can load the raw dataset from the hub as shown in Option 1. Alternatively if the data is in CSV format, you can load as shown in Option 2

# Option 1: Load directly from Hugging Face
from datasets import load_dataset

data = load_dataset(
    'FreedomIntelligence/medical-o1-reasoning-SFT',
    'en'
)[split]


# Option 2: Load from your own CSV file
from datasets import load_dataset

dataset = load_dataset(
    'csv',
    data_files={'train': 'your_data_set.csv'}
)

 

  • Create a System Prompt:
    Use the system prompt to produce the desired output format:
 SYSTEM_PROMPT = """
Respond in the following format:

<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

 

3. Defining Reward Functions 

At the heart of GRPO lies the reward signal: it tells the model what good behavior looks like and how to improve. In a medical reasoning context, “good behavior” has three dimensions—accuracy of the answer, readability of the explanation, and adherence to our prescribed <reasoning>/<answer> format. By carefully designing each reward function to target one of these dimensions, we can guide the model toward responses that are not only clinically correct but also clear and properly structured.

In this section, we’ll introduce three complementary reward signals:

  1. Semantic Correctness
    Measures how closely the model’s answer matches the ground-truth reference, ensuring medical accuracy.

  2. Perplexity (Fluency)
    Assesses the naturalness and readability of the generated text using a language model. It penalizes outputs that are less useful or harder to understand..

  3. Tag Presence
    Verifies that each completion wraps its chain of thought in <reasoning> tags and its final conclusion in <answer> tags, guaranteeing consistent formatting.

Together, these rewards form a balanced scorecard—accuracy, fluency, and format—so your fine-tuned model learns not just to say the right thing, but to say it in exactly the way you’ve specified.

Now, we explain the reward functions in more detail:

3.1. Device Configuration

  Sets whether to run models on GPU (CUDA) or CPU based on availability.

main_device = "cuda" if torch.cuda.is_available() else "cpu"
reward_device = "cuda" if torch.cuda.is_available() else "cpu"


3.2. Semantic Correctness Reward

Uses the cross-encoder/stsb-roberta-base model to score semantic similarity between each generated response and a ground truth answer. 

  • CrossEncoder takes pairs (response, answer) and returns similarity scores between 0 and 1.
  • If response is empty, the reward is -1 (penalizing no response)

 

def semantic_correctness(responses: List[str], answers: List[str]) -> List[float]:
    ...

 

3.3. Perplexity Calculation Reward

Uses the cross-encoder/stsb-roberta-base model to score semantic similarity between each generated response and a ground truth answer. 

  • CrossEncoder takes pairs (response, answer) and returns similarity scores between 0 and 1.
  • If the response is empty, the reward is -1 (penalizing no response)

 

class PerplexityCalculator:
    ...


Measures how fluent or "natural" a sentence is using a pretrained language model, such as
microsoft/biogpt.

  •     Tokenizes and feeds batches of texts into the model.
  •     Calculates the loss of the model (how "surprised" it is by the input).
  •     Converts the loss to perplexity using the formula: exp(loss)
  •     High perplexity indicates lower reward (unreadable/unnatural), so later we invert this.

3.4. Tag Presence Reward

This encourages generated outputs to include:

    <reasoning> ... </reasoning>

    <answer> ... </answer>

 

def tag_presence_reward(completions: List[dict]) -> List[float]:
    …

 

  •     Uses regular expressions to check for the presence of each tag.
  •     Gives a reward of 0.5 for each present tag.

3.5. Combined Reward Function

This is the main reward function used during the RL training loop. 

def combined_reward_func(prompts, completions, answer, **kwargs):
    …

It does the following:

Step 1: Parse generated <answer> content

  • Extracts just the text inside <answer>...</answer> tags.
  • Removes empty responses or those that copy the prompt.

Step 2: Calculate rewards

  • Semantic similarity with reference answers.
  • Perplexity using BioGPT.
  • Tag presence reward.

Step 3: Normalize and combine scores

A weighted sum is calculated: 

combined = [
    0.5 * sim + 0.4 * perplex + 0.1 * tag
    for sim, perplex, tag in zip(...)
]

Weighted sum of:

  • 0.5 × similarity
  • 0.4 × perplexity reward
  • 0.1 × tag reward

Step 4: Clamp reward

  • Output values are clamped to the range [-1.0, 1.0].

Perplexity Normalization Logic:

perplex_rewards = 1 / (perplex_scores / (perplex_scores.mean() + 1e-9))

 

  • Lower perplexity → higher reward.
  • Then it's normalized to a [0, 1] scale: 
    • (x - min) / (max - min)

Outputs:

A List[float] of rewards, one per completion, which are aligned with the input batch. Any invalid or empty completions automatically receive a score of -1.0.

4. Configuring Training 

In this section, we define the GRPOConfig settings for fine-tuning the model. Here we break down the core GRPOConfig options - learning rates, optimizer choices, batch sizes, and number of generations per prompt to show you how to balance iteration speed against memory constraints.

Here’s the code to configure hyperparameters for GRPO training:

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm=True,  # use vLLM for fast inference!
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    optim="adamw_8bit",
    logging_steps=1,
    bf16=is_bfloat16_supported(),
    fp16=not is_bfloat16_supported(),
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_generations=5,  # Decrease if out of memory
    max_prompt_length=128,  # 128 to balance longer input prompts with training time requirements
    max_completion_length=128,
    max_steps=total_steps,
    save_steps=int(total_steps // num_checkpoints),
    max_grad_norm=0.1,
    report_to="none",  # Can use Weights & Biases
    output_dir="grpo_outputs",
    save_strategy="steps",)


4.1 Key Points about Training Configuration:


The GRPOConfig sets various hyperparameters for training:

  • use_vllm: Enables fast inference with vLLM
  • Hyperparameters: Define learning rate, batch size, epochs, and so on.
    • learning_rate: Controls how quickly the model learns. 
    • A low initial LR (5e-6) with a cosine decay and 10% warmup helps the model adapt gradually without overshooting
  • Generation Settings: Specify maximum lengths for prompts and completions.
    • num_generations: Number of completions to generate for each prompt. 
    • GRPO relies on comparing a group of outputs for each prompt. More generations → better signal, but higher latency and memory use.
  • optim="adamw_8bit": 8-bit Adam lets you train larger models on limited GPU memory with minimal quality loss
  • max_steps: Total number of training steps to perform
  • Precision (bf16 vs fp16): Uses bfloat16 for enhanced performance on supported hardware. If your GPU supports bfloat16, you’ll get extra stability; otherwise, fall back to fp16.
  • Max Prompt vs. completion lengths: Balancing input content and generation length keeps VRAM usage predictable. If you have longer medical cases, increase both values and reduce num_generations accordingly.

5.  Executing the Training Loop

With data, rewards, and training configuration in place, the next step is to launch trainer.train(). In this section, we execute the training loop to verify that the model is learning. The call to trainer.train() will run for training_args.max_steps iterations (200 in this example), generating and scoring groups of outputs, then updating the model each step.

Here’s the code to execute GRPO training:

from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

# Set up the GRPO trainer with reward functions and the dataset.
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        combined_reward_func
    ],
    args = training_args,
    train_dataset = train_dataset,)
# Begin training.
trainer.train()

 

5.1 Key Points about Training Execution:

  • Model Loading: Prepares the language model by loading a pre-trained language model with the appropriate precision. All training and evaluation logic works across supported architectures with minimal changes
    • To switch models, update this line in the notebook:
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
	# Swap with another base model 
	# model_name = "Qwen/Qwen2.5-7B" 
	# model_name = "microsoft/Phi-4" 
	# model_name = "google/gemma-3-1b-it"
  • Tokenizer: Converts text to tokens and also converts tokens back into readable text for evaluation.
  • Trainer Setup: Utilizes the model, reward functions, dataset, and training configuration to fine-tune the model via reinforcement learning
  • Optional PEFT: Can be enabled for parameter-efficient fine-tuning.

5.2 Results of Training Performance

  • Initial reward: -0.30, this is a sign of random / untrained behavior
  • Mid-training: Rewards fluctuate as the model explores (eg, from -0.2 to +0.2)
  • End of training: By the end of the 200-step training regime, rewards steadily climb from negative to strongly positive values and plateau around +0.55-0.65, indicating improved semantic accuracy, fluency, and format adherence

6. Testing the Fine-Tuned Model

 

After completing the training process,, we’ll save LoRA weights and merge them back into the base model. Testing begins with running a set of representative  medical questions-ideally ones not seen during training through both the base and fine-tuned model.  This allows us to qualitatively access  improvements in areas, such as structured reasoning tags, more concise answers, and domain-specific correctness.

Let’s test our model to see how it performs. For this, let’s first save the LoRA weights:

model.save_lora("grpo_saved_lora")

Now, let’s test the model with a new question:

from vllm import SamplingParams
text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Is Aspirin good for cardiovascular function?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)
output = (
    model.fast_generate(
     Text,
     sampling_params=sampling_params,
     lora_request=model.load_lora("grpo_saved_lora"),
      )[0]
     .outputs[0]
    .text
)
print(output)

You should observe that the model now adheres to the specified format, presenting its reasoning before providing an answer.

6.1 Before vs After GRPO

  • Before GRPO Fine-Tuning:
    • Free form response, no <reasoning>/<answer> tags
    • Responses are detailed, but unstructured output
  • After GRPO Fine-Tuning
    • We notice structured output, with a <reasoning> section, then <answer> section
    • Strict adherence to  the system prompt format
    • Concise, relevant clinical output

Input: “Is Aspirin good for cardiovascular function?”

Output (Before GRPO): A response from the base LLM, answering the question “Is Aspirin good for cardio vascular function?”

Screenshot 2025-07-16 at 1.04.44 PM.png

 

Output (After GRPO): A response with reasoning leading to the final answer, with a more structured output presented within the answer tags. Additional training can lead to more fine-grained reasoning traces within the reasoning tags.

 

Screenshot 2025-07-16 at 1.05.55 PM.png


7. Evaluating the Model

Evaluation is crucial for assessing the fine-tuned model's performance across multiple dimensions. The core idea is to use an LLM “judge” (e.g. GPT-4o-mini or GPT-4) to score model outputs on multiple axes.

7.1. Evaluation Dataset

 

  1. Size & Sampling

  • Randomly select 100 examples from your held-out test_dataset.
  • Ensure diversity in question types (e.g., differential diagnosis, drug interactions, physiology.).

   
     2. Data Structure

Each example should contain:

{
"question": str, # e.g. "What are the risk factors for DVT?"
"reference_answer": str, # the ground-truth <reasoning>…</reasoning><answer>…</answer>
"system_prompt": SYSTEM_PROMPT
}

 

7.2. Evaluation Dimensions

 

Use the judge to score each output (base vs. fine-tuned) on a 1–5 scale across the following metrics:

 

  1. Medical Accuracy

 

  • Does the answer reflect correct, evidence-based medical knowledge?

 

     2. Reasoning Quality

 

  • Is the chain of thought logical, step-wise, and medically coherent?

 

     3. Format Adherence

 

  • Are the <reasoning> and <answer> tags correctly used?

 

     4. Fluency & Clarity

 

  • Is the language clear, concise, and free of major grammatical issues?

 

     5. Overall Usefulness

 

  • Would this output help a clinician or student understand the reasoning and answer?


Now we compute an Overall Score as the average of the above
.


7.3. LLM-as-Judge Prompt Template

 

 

You are a medical expert and evaluator. For each case below, you will see:

- QUESTION: {question}
- REFERENCE: {reference_answer}
- MODEL OUTPUT: {model_output}

Please provide scores (1–5 where 1=poor, 5=excellent):
MEDICAL_SCORE:
REASONING_SCORE:
FORMAT_SCORE:
FLUENCY_SCORE:
USEFULNESS_SCORE:
OVERALL_SCORE: [average of above]
EXPLANATION: [2–3 sentences explaining your scoring]

 

 

7.4. Interpreting Results

 

We sample 100 test cases for statistical power.

  • For each, generate both base and fine-tuned outputs.

  • Use a deterministic judge call (temperature=0.0) to reduce variance.

  • Parse the judge’s structured response into numeric scores.

  • Aggregate mean & std for each metric and model. 


    Δ in Mean Scores: An improvement in scores (e.g., +0.8 → +1.2) on MEDICAL_SCORE or REASONING_SCORE indicates better domain reasoning.


8. Saving Your Model


Unsloth provides several options for saving your fine-tuned model, but we’ll focus on the most common use case.

Saving in 16-bit Precision

You can save your fine-tuned  model with 16-bit precision using the following command:

# Save to 16-bit precision

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

Pushing to Hugging Face Hub

To share your model with the community or collaborators, you can push it to the Hugging Face Hub using the push_to_hub_merged method. This method allows us to push the model in multiple quantization formats.

# Push to Hugging Face Hub (requires a token)

model.push_to_hub_merged(

    "your-username/model-name", tokenizer, save_method="merged_16bit", token="your-token"
)

Saving in GGUF Format for llama.cpp

Unsloth also supports saving your model in GGUF format for use with llama.cpp:

#Saving in GGUF Format for llama.cpp

model.push_to_hub_gguf(

    "your-username/model-name",

    tokenizer,

    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],

    token="your-token",
)

The GGUF files can be used with llama.cpp or UI-based systems like Jan or Open WebUI.

9. Impact & Takeaways

 

  • In this guide, you’ve learned how to:
    • Prepare data for GRPO training
    • Define custom reward functions to guide the model’s learning
    • Train a model using GRPO
    • Test the fine-tuned model
    • Evaluate the fine-tuned model
    • Save the model in various formats
  • Key Benefits:
    • Cost savings upto 90% VRAM savings compared to PPO methods.
    • LLMs trained with GRPO provide structured, explainable reasoning, improving AI trustworthiness
  • Real-World Applications:
    • Beyond medical AI applications such as Clinical decision support systems, Virtual care triage assistant, and  Educational tools for medical residents, GRPO can enhance use cases in other domains including legal analysis, content, and compliance workflows.
  • What’s Next?
    • Run this notebook within your environment. 
    • One-click deploy and experiment with the GRPO AMP within your Cloudera AI environment, extending GRPO by customizing base models, datasets and reward functions, and see how it improves structured reasoning.
    • If you don’t have a Cloudera Notebook, then register for 5-day Cloudera trial and experience this AMP within Cloudera AI: https://www.cloudera.com/products/cloudera-public-cloud-trial.html


10. Conclusion

  • Reward Functions:
    Serve as the model's scorecard by evaluating correctness and format, guiding the model to produce improved responses.

  • Custom Training Dataset:
    A curated dataset (e.g. for medical reasoning) transforms a general-purpose language model into a domain-specific expert for your specific use cases.

  • Improved Reasoning Outcomes:
    In this fine tuning guide to building your own medical reasoning model with GRPO, we notice how the fine tuned model not only learns how to maximize a balanced reward (across accuracy, fluency, structure), but also adopts the prescribed <reasoning> and <answer> format. Quantitatively, the reward score shifts from around -0.3 (pre-training) to +0.6 (post training), and qualitatively, the output becomes neatly structured and domain focused.

  • Overall Benefit:
    This approach converts a general language model into a domain-specific expert medical reasoning model that can efficiently provide accurate and structured medical answers. These steps can be adapted by organisations with complex datasets for their own domain specific use cases. As you continue exploring GRPO, consider experimenting with different group sizes, datasets, base models, reward functions, and KL penalty coefficients to see how they affect your model’s performance.

  • Additional Resources:

 

4,779 Views
0 Kudos