AI / ML·March 28, 2026·8 min read

Teaching LLMs with RL: Agentic Frameworks and Frozen Models

Explore how reinforcement learning agents with frozen language models can curate high-quality training datasets — the architecture behind modern RLHF pipelines.

#AI #Reinforcement Learning #LLM #RLHF

The Core Idea

Reinforcement learning from human feedback (RLHF) has powered models like GPT-4 and Claude. But collecting human feedback at scale is expensive. What if an agent — itself powered by a frozen language model — could do the curation automatically?

This is the architecture I built at Preference Model: an agentic RL framework where frozen LLMs act as evaluators and dataset curators.

Architecture Overview


┌────────────────────────────────────────────┐
│  Rollout Engine                            │
│  (Environment generates trajectories)      │
└───────────────┬────────────────────────────┘
                │
                ▼
┌────────────────────────────────────────────┐
│  Frozen LLM Judge                          │
│  (Scores each trajectory 0–1)              │
└───────────────┬────────────────────────────┘
                │
                ▼
┌────────────────────────────────────────────┐
│  Offline Dataset Buffer                    │
│  (Only high-quality pairs are stored)      │
└───────────────┬────────────────────────────┘
                │
                ▼
┌────────────────────────────────────────────┐
│  Policy Update (DPO / PPO)                 │
└────────────────────────────────────────────┘

Why Freeze the LLM?

Freezing the inference model avoids catastrophic forgetting and keeps evaluation consistent across training iterations. The frozen model essentially acts as a stable reward oracle.


from anthropic import Anthropic

client = Anthropic()

def judge_trajectory(trajectory: str) -> float:
    """Score a trajectory using a frozen LLM judge."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=16,
        messages=[{
            "role": "user",
            "content": f"Rate this trajectory on a scale 0.0–1.0. Reply with only a number.\n\n{trajectory}"
        }]
    )
    return float(response.content[0].text.strip())

Dataset Curation Strategy

Rather than storing every trajectory, we apply a top-K filter per batch:


def curate_batch(trajectories: list[str], top_k: float = 0.3) -> list[tuple]:
    scored = [(t, judge_trajectory(t)) for t in trajectories]
    scored.sort(key=lambda x: x[1], reverse=True)
    cutoff = int(len(scored) * top_k)
    return scored[:cutoff]

This ensures the offline buffer only contains the best demonstrations, making downstream policy updates far more sample-efficient.

Key Takeaways

Frozen LLMs make excellent reward models when fine-tuning is infeasible.
Agentic curation dramatically reduces the need for human annotators.
The pipeline generalises to any task that can be expressed as a trajectory.

If you're building similar systems, feel free to reach out — always happy to talk RL architectures.

Sohaib Sarosh Shamsi

Full-Stack & AI/ML Engineer — building intelligent systems.

GitHub LinkedIn

Building a Real-Time Fraud Detection Pipeline with Elastic Retraining