RL Framework Worklog I -- Handling Off-policy

08 Oct 2025 — 5 min read

Training large language models with reinforcement learning runs into a fundamental issue: off-policiness. As the policy updates, the rollouts still arriving from earlier checkpoints become stale—their distribution drifts away from the current model. In latency-heavy, agentic/tool-calling settings (web calls, DB queries, GUI actions), rollouts can take minutes, so staleness is the norm, not the edge case.

In our experiments, we found a sharp asymmetry: positive-reward data tolerates staleness surprisingly well, but negative-reward data collapses under moderate off-policiness. We then show a simple, effective fix: asymmetric importance ratio clipping, which stabilizes training for stale negatives while preserving sample-efficiency on positives.

What is Off-Policiness?

Off-policiness measures the divergence between the policy that generated training data and the current policy being optimized. Let $\pi_{\text{old}}$ be the policy that generated trajectory $\tau$, and $\pi_\theta$ be the current policy. The importance ratio at trajectory level is:

As training progresses, trajectories sampled from $\pi_\text{old}$ become increasingly unrepresentative of $\pi_\theta$'s behavior. When $\rho \gg 1$ or $\rho \ll 1$, gradient estimates become high-variance and potentially destructive to model performance.

Why Off-Policiness Hurts GRPO/PPO Training

The policy gradient with importance sampling in GRPO/PPO takes the form:

where $A(\tau)$ is the advantage (calculated via GAE over learned value function in PPO or group normalization in GRPO). Separating by advantage sign reveals the core asymmetry:

The critical insight:

For positives $(A > 0)$ : The gradient increases $\log \pi_\theta(\tau)$ , pulling the policy toward these trajectories
For negatives $(A < 0)$ : The gradient decreases $\log \pi_\theta(\tau)$, pushing $\pi_\theta(\tau) \rightarrow 0$

The negative term creates an unbounded optimization problem: we can always decrease the loss further by driving $\pi_\theta(\tau)$ toward zero for any negative trajectory from $\pi_{\text{old}}$. Under stale data, this becomes catastrophic—the model wastes capacity "digging holes" in probability space where it no longer explores, potentially erasing useful features.

Baselines help variance but do not remove this sign asymmetry. the effect survives GRPO, PPO, and vanilla REINFORCE forms.

In-Flight Weight Updates: Embracing Staleness

Rather than waiting for all rollouts to complete before updating, in-flight weight updates continuously update the policy while rollouts are still being generated. This approach, detailed in Magistral (arxiv:2506.10910) and related work (arxiv:2505.24298, arxiv:2509.19128), addresses long-tail latency by:

Updating the policy asynchronously without blocking on slow rollouts
Either recomputing KV cache or continuing from stale KV cache when weights update mid-generation
Using importance weighting to correct for the resulting staleness (V-Trace – used by both Areal and LlamaRL to mitigate off-policiness)

This is crucial for real-world agentic systems where tool calls (database queries, API requests, GUI agents) have unpredictable latency.

Experiments: Negative Rewards and Off-Policiness

We trained Qwen3-30B-A3B on MATH500 (a standard mathematical reasoning benchmark) to quantify how negative rewards degrade under off-policy conditions. Training was conducted using the Slime Framework on 8× H100 GPUs with a global batch size of 256.

Our setup uses two inference servers:

Current server: Always uses the latest policy checkpoint
Stale server: Uses the policy from $k$ gradient steps ago (i.e., on-policy for the first $k$ steps of training, then perpetually stale by $k$ steps thereafter)

Unlike standard off-policy training where all data comes from a single stale policy, we isolate staleness effects by reward type:

For each prompt, we generate responses from:

π(t): Current policy checkpoint
π(t-k): Policy from k gradient steps ago We then create hybrid datasets:
- Stale Positive: Positive-reward trajectories from π(t-k), negative from π(t)
- Stale Negative: Negative-reward trajectories from π(t-k), positive from π(t)

This lets us measure staleness impact on each reward type independently, unlike standard off-policy setups where all rewards degrade together.

We measure maximum reward (accuracy) achieved over 30 training steps.

Results: Max Reward Achieved on MATH500

Training Setup	Stale Positive Rewards	Stale Negative Rewards
On-policy (0 stale)	0.73	0.72
3-gen stale	0.76	0.48
5-gen stale	0.78	0.18 ⚠️

Key findings:

With positive rewards, staleness has minimal impact—performance even improves slightly (0.73 → 0.78)
With negative rewards, training catastrophically collapses at 5 steps of staleness (0.72 → 0.18, a 75% drop)
The model never recovers from this collapse, getting stuck in a destructive loop

This asymmetry occurs because negative gradients actively push probabilities toward zero. When the data is stale, the model has already shifted away from these trajectories—the negative gradient then destroys representations that might now support positive behaviors.

Asymmetric Importance Clipping

V-trace from IMPALA (https://arxiv.org/abs/1802.01561) addresses off-policiness through importance ratio clipping:

$$\bar{\rho} = \min(\rho_\text{max}, \rho(\tau))$$

However, this treats positive and negative rewards symmetrically. We propose asymmetric clipping that respects the differential fragility:

where $\bar{c} = 1.0$ (no upweighting) and $\underline{c} = 0.5$ (gentle downweighting).

Rationale:The asymmetric clipping provides different treatment based on reward sign and staleness:

Positive rewards + high $\rho$ : Allow upweighting (up to $c_{\text{pos}}$ ) for faster learning
Positive rewards + low $\rho$: Natural downweighting is fine, still learning the right direction
Negative rewards + high $\rho$: Cap at 1.0 to prevent amplifying negatives the model now prefers
Negative rewards + low $\rho$: Floor at 0.5 to maintain signal without destructive "dig a hole" behavior

Pseudo-code

def asymmetric_clip(rho, advantage, c_pos=10.0, c_neg_up=1.0, c_neg_down=0.5):
    """
    Asymmetric importance ratio clipping based on advantage sign.
    
    Args:
        rho: Importance ratio π_θ(τ) / π_old(τ)
        advantage: GRPO normalized advantage (can be ±)
        c_pos: Maximum ratio for positive advantages
        c_neg_up: Maximum ratio for negative advantages when ρ > 1
        c_neg_down: Minimum ratio for negative advantages when ρ ≤ 1
    """
    if advantage > 0:
        # Positive: allow upweighting for acceleration
        return torch.clamp(rho, max=c_pos)
    else:
        # Negative: prevent destructive updates
        if rho > 1.0:
            # Don't amplify negatives we now prefer
            return torch.clamp(rho, max=c_neg_up)
        else:
            # Maintain minimum gradient signal
            return torch.clamp(rho, min=c_neg_down)

# In GRPO training loop
importance_ratio = (pi_theta_logprobs - pi_old_logprobs).exp()
clipped_ratio = asymmetric_clip(importance_ratio, advantage)
loss = -(clipped_ratio * advantage * pi_theta_logprobs).mean()

Why this works?

The asymmetric approach addresses the core failure mode of off-policy training with negatives:

Prevents unbounded optimization: By capping negative importance weights at 1.0, we prevent the model from aggressively pushing away from trajectories it has grown to like
Maintains learning signal: The floor at 0.5 for stale negatives ensures gradients don't vanish entirely
Accelerates positive learning: High caps for positives allow the model to quickly exploit good trajectories even when off-policy
No KL penalty needed: This approach achieves stability without the computational overhead of KL regularization

Future Directions

Dynamic Clipping Schedules

Rather than fixed thresholds, adaptively scale clipping based on observed staleness:

# Track staleness impact
staleness_metric = compute_kl(pi_current, pi_old)
c_neg_down = max(0.1, 0.5 * exp(-staleness_metric))
c_pos = min(20.0, 10.0 * (1 + staleness_metric))

Conclusion

Off-policiness is not merely a theoretical concern—it manifests as catastrophic performance degradation, particularly for negative rewards in long-tail rollout scenarios. Our experiments clearly demonstrate an asymmetry: positive rewards tolerate and even benefit from staleness, while negative rewards collapse dramatically.

Asymmetric importance clipping provides a simple, effective solution that:

Stabilizes training under high off-policiness
Maintains sample efficiency
Requires no KL regularization
Integrates seamlessly with existing GRPO pipelines

For real-world agentic RL with unpredictable tool-calling latencies, this technique is essential for stable, scalable training.