Writing & Thoughts
The Blog
Deep-dives into AI, ML systems, data engineering, and full-stack development — written from the trenches.
Deep-dives into AI, ML systems, data engineering, and full-stack development — written from the trenches.
Explore how reinforcement learning agents with frozen language models can curate high-quality training datasets — the architecture behind modern RLHF pipelines.
Reinforcement learning from human feedback (RLHF) has powered models like GPT-4 and Claude. But collecting human feedback at scale is expensive. What if an agent — itself powered by a frozen language model — could do the curation automatically?
This is the architecture I built at Preference Model: an agentic RL framework where frozen LLMs act as evaluators and dataset curators.
┌────────────────────────────────────────────┐ │ Rollout Engine │ │ (Environment generates trajectories) │ └───────────────┬────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────┐ │ Frozen LLM Judge │ │ (Scores each trajectory 0–1) │ └───────────────┬────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────┐ │ Offline Dataset Buffer │ │ (Only high-quality pairs are stored) │ └───────────────┬────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────┐ │ Policy Update (DPO / PPO) │ └────────────────────────────────────────────┘
Freezing the inference model avoids catastrophic forgetting and keeps evaluation consistent across training iterations. The frozen model essentially acts as a stable reward oracle.
from anthropic import Anthropic client = Anthropic() def judge_trajectory(trajectory: str) -> float: """Score a trajectory using a frozen LLM judge.""" response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=16, messages=[{ "role": "user", "content": f"Rate this trajectory on a scale 0.0–1.0. Reply with only a number.\n\n{trajectory}" }] ) return float(response.content[0].text.strip())
Rather than storing every trajectory, we apply a top-K filter per batch:
def curate_batch(trajectories: list[str], top_k: float = 0.3) -> list[tuple]: scored = [(t, judge_trajectory(t)) for t in trajectories] scored.sort(key=lambda x: x[1], reverse=True) cutoff = int(len(scored) * top_k) return scored[:cutoff]
This ensures the offline buffer only contains the best demonstrations, making downstream policy updates far more sample-efficient.
If you're building similar systems, feel free to reach out — always happy to talk RL architectures.