Created on 07-16-2025 10:18 AM
This guide is Part 2 of our series on GRPO-based fine-tuning. It assumes you are already familiar with the core concepts of reasoning models and GRPO (see Part 1 Understanding Reasoning Models with GRPO: A Conceptual Introduction for Building your own Med... ), and you are ready to apply them in practice using Python. All code is provided in a reproducible notebook, with detailed explanations and references to help you get started.
Abstract: In this guide, we’ll walk step by step through fine tuning a large language model on a medical reasoning dataset from Hugging Face, using Group Relative Policy Optimization (GRPO). In the following sections, you will see exactly how the technical setup of GRPO works, how the data is structured, how to create custom reward functions, and how to test, evaluate, and save the medical reasoning model. By the end, you will have a working GRPO pipeline for building your own medical reasoning model that produces <reasoning>..</reasoning> as well as <answer>..</answer> outputs tailored for medical questions - complete with code snippets, narrative explanations, and links to key resources.
Below is the guide outline. Each section builds on the previous one and by the end, you will not only understand the theory behind GRPO but also have a fully reproducible medical‐reasoning reinforcement fine-tuning pipeline.
Group Relative Policy Optimization (GRPO) is a reinforcement-learning method designed to steer language models toward desired behaviors by leveraging groupwise feedback instead of relying on individual examples. During each training iteration, the model:
By comparing candidates within each group, GRPO amplifies useful behaviors (e.g. clear reasoning, correct answers, proper formatting) without requiring manually crafted labels for every training example.
To implement GRPO, generate multiple responses, score them using reward functions, compare them within a batch, and update the LLM based on the best responses.
A visual representation of the RL pipeline used in DeepSeek-R1-Zero is illustrated below. This was the first open-source model to demonstrate that advanced reasoning capabilities can be developed purely through reinforcement learning. Without pre-labeled datasets, the model learns through trial and error, refining its behavior, parameters, and weights based solely on feedback from the solutions it generates.
Fig 1: RL pipeline used in Deepseek-R1-Zero, Reference: A Visual Guide to Reasoning LLMs
Now that we have outlined the key components of GRPO, let’s look at the algorithm in pseudocode. While simplified, this version highlights the core concepts..
Input:
- initial_policy: Starting model to be trained
- reward_function: Function that evaluates the quality of outputs
- training_prompts: Set of training examples
- group_size: Number of outputs generated per prompt (typically 4-16)
Algorithm for GRPO:
iii. Normalize rewards within group:
normalized_advantage = (reward - mean(rewards)) / std(rewards)
min(prob_ratio * normalized_advantage,
clip(prob_ratio, 1-epsilon, 1+epsilon) * normalized_advantage)
- kl_weight * KL(initial_policy || reference_policy)
where prob_ratio is current_prob / reference_prob
Output: Optimized policy model
Reference: https://huggingface.co/learn/llm-course/chapter12/3
GRPO’s key innovations are:
This shows how GRPO combines group-based advantage estimation with policy optimization while maintaining stability through clipping and KL divergence constraints. By comparing each candidate with its peers and carefully regularizing updates, GRPO provides a stable yet adaptive reinforcement learning fine-tuning algorithm, well-suited for structured generation tasks, like medical reasoning.
This guide walks you through the process of adapting GRPO to your specific dataset, using a combination of tutorials, examples, and practical steps.
If you're eager to start implementing GRPO, our example notebook is an excellent resource to get you up and running quickly. The notebook demonstrates GRPO training using the Llama 3.1 (8B Instruct) base model and includes step-by-step instructions for:
# Option 1: Load directly from Hugging Face
from datasets import load_dataset
data = load_dataset(
'FreedomIntelligence/medical-o1-reasoning-SFT',
'en'
)[split]
# Option 2: Load from your own CSV file
from datasets import load_dataset
dataset = load_dataset(
'csv',
data_files={'train': 'your_data_set.csv'}
)
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
At the heart of GRPO lies the reward signal: it tells the model what good behavior looks like and how to improve. In a medical reasoning context, “good behavior” has three dimensions—accuracy of the answer, readability of the explanation, and adherence to our prescribed <reasoning>/<answer> format. By carefully designing each reward function to target one of these dimensions, we can guide the model toward responses that are not only clinically correct but also clear and properly structured.
In this section, we’ll introduce three complementary reward signals:
Together, these rewards form a balanced scorecard—accuracy, fluency, and format—so your fine-tuned model learns not just to say the right thing, but to say it in exactly the way you’ve specified.
Now, we explain the reward functions in more detail:
Sets whether to run models on GPU (CUDA) or CPU based on availability.
main_device = "cuda" if torch.cuda.is_available() else "cpu"
reward_device = "cuda" if torch.cuda.is_available() else "cpu"
Uses the cross-encoder/stsb-roberta-base model to score semantic similarity between each generated response and a ground truth answer.
def semantic_correctness(responses: List[str], answers: List[str]) -> List[float]:
...
Uses the cross-encoder/stsb-roberta-base model to score semantic similarity between each generated response and a ground truth answer.
class PerplexityCalculator:
...
Measures how fluent or "natural" a sentence is using a pretrained language model, such as microsoft/biogpt.
This encourages generated outputs to include:
<reasoning> ... </reasoning>
<answer> ... </answer>
def tag_presence_reward(completions: List[dict]) -> List[float]:
…
This is the main reward function used during the RL training loop.
def combined_reward_func(prompts, completions, answer, **kwargs):
…
It does the following:
Step 1: Parse generated <answer> content
Step 2: Calculate rewards
Step 3: Normalize and combine scores
A weighted sum is calculated:
combined = [
0.5 * sim + 0.4 * perplex + 0.1 * tag
for sim, perplex, tag in zip(...)
]
Weighted sum of:
Step 4: Clamp reward
Perplexity Normalization Logic:
perplex_rewards = 1 / (perplex_scores / (perplex_scores.mean() + 1e-9))
Outputs:
A List[float] of rewards, one per completion, which are aligned with the input batch. Any invalid or empty completions automatically receive a score of -1.0.
In this section, we define the GRPOConfig settings for fine-tuning the model. Here we break down the core GRPOConfig options - learning rates, optimizer choices, batch sizes, and number of generations per prompt to show you how to balance iteration speed against memory constraints.
Here’s the code to configure hyperparameters for GRPO training:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
use_vllm=True, # use vLLM for fast inference!
learning_rate=5e-6,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
optim="adamw_8bit",
logging_steps=1,
bf16=is_bfloat16_supported(),
fp16=not is_bfloat16_supported(),
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
num_generations=5, # Decrease if out of memory
max_prompt_length=128, # 128 to balance longer input prompts with training time requirements
max_completion_length=128,
max_steps=total_steps,
save_steps=int(total_steps // num_checkpoints),
max_grad_norm=0.1,
report_to="none", # Can use Weights & Biases
output_dir="grpo_outputs",
save_strategy="steps",)
With data, rewards, and training configuration in place, the next step is to launch trainer.train(). In this section, we execute the training loop to verify that the model is learning. The call to trainer.train() will run for training_args.max_steps iterations (200 in this example), generating and scoring groups of outputs, then updating the model each step.
Here’s the code to execute GRPO training:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.5, # Reduce if out of memory
)
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
], # Remove QKVO if out of memory
lora_alpha = lora_rank,
use_gradient_checkpointing = "unsloth", # Enable long context finetuning
random_state = 3407,
)
# Set up the GRPO trainer with reward functions and the dataset.
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
combined_reward_func
],
args = training_args,
train_dataset = train_dataset,)
# Begin training.
trainer.train()
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Swap with another base model
# model_name = "Qwen/Qwen2.5-7B"
# model_name = "microsoft/Phi-4"
# model_name = "google/gemma-3-1b-it"
After completing the training process,, we’ll save LoRA weights and merge them back into the base model. Testing begins with running a set of representative medical questions-ideally ones not seen during training through both the base and fine-tuned model. This allows us to qualitatively access improvements in areas, such as structured reasoning tags, more concise answers, and domain-specific correctness.
Let’s test our model to see how it performs. For this, let’s first save the LoRA weights:
model.save_lora("grpo_saved_lora")
Now, let’s test the model with a new question:
from vllm import SamplingParams
text = tokenizer.apply_chat_template(
[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Is Aspirin good for cardiovascular function?"},
],
tokenize=False,
add_generation_prompt=True,
)
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=1024,
)
output = (
model.fast_generate(
Text,
sampling_params=sampling_params,
lora_request=model.load_lora("grpo_saved_lora"),
)[0]
.outputs[0]
.text
)
print(output)
You should observe that the model now adheres to the specified format, presenting its reasoning before providing an answer.
Input: “Is Aspirin good for cardiovascular function?”
Output (Before GRPO): A response from the base LLM, answering the question “Is Aspirin good for cardio vascular function?”
Output (After GRPO): A response with reasoning leading to the final answer, with a more structured output presented within the answer tags. Additional training can lead to more fine-grained reasoning traces within the reasoning tags.
Evaluation is crucial for assessing the fine-tuned model's performance across multiple dimensions. The core idea is to use an LLM “judge” (e.g. GPT-4o-mini or GPT-4) to score model outputs on multiple axes.
2. Data Structure
Each example should contain:
{
"question": str, # e.g. "What are the risk factors for DVT?"
"reference_answer": str, # the ground-truth <reasoning>…</reasoning><answer>…</answer>
"system_prompt": SYSTEM_PROMPT
}
Use the judge to score each output (base vs. fine-tuned) on a 1–5 scale across the following metrics:
2. Reasoning Quality
3. Format Adherence
4. Fluency & Clarity
5. Overall Usefulness
Now we compute an Overall Score as the average of the above.
You are a medical expert and evaluator. For each case below, you will see:
- QUESTION: {question}
- REFERENCE: {reference_answer}
- MODEL OUTPUT: {model_output}
Please provide scores (1–5 where 1=poor, 5=excellent):
MEDICAL_SCORE:
REASONING_SCORE:
FORMAT_SCORE:
FLUENCY_SCORE:
USEFULNESS_SCORE:
OVERALL_SCORE: [average of above]
EXPLANATION: [2–3 sentences explaining your scoring]
We sample 100 test cases for statistical power.
Δ in Mean Scores: An improvement in scores (e.g., +0.8 → +1.2) on MEDICAL_SCORE or REASONING_SCORE indicates better domain reasoning.
Unsloth provides several options for saving your fine-tuned model, but we’ll focus on the most common use case.
Saving in 16-bit Precision
You can save your fine-tuned model with 16-bit precision using the following command:
# Save to 16-bit precision
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
Pushing to Hugging Face Hub
To share your model with the community or collaborators, you can push it to the Hugging Face Hub using the push_to_hub_merged method. This method allows us to push the model in multiple quantization formats.
# Push to Hugging Face Hub (requires a token)
model.push_to_hub_merged(
"your-username/model-name", tokenizer, save_method="merged_16bit", token="your-token"
)
Saving in GGUF Format for llama.cpp
Unsloth also supports saving your model in GGUF format for use with llama.cpp:
#Saving in GGUF Format for llama.cpp
model.push_to_hub_gguf(
"your-username/model-name",
tokenizer,
quantization_method=["q4_k_m", "q8_0", "q5_k_m"],
token="your-token",
)
The GGUF files can be used with llama.cpp or UI-based systems like Jan or Open WebUI.