RV-Filter Logo RAGEN-V2

Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning

Even when entropy stays high, agents can quietly stop listening to inputs — producing fluent but input-agnostic boilerplate. We call this template collapse.

Zihan Wang*†1, Chi Gui†2, Xing Jin†3, Qineng Wang†1, Licheng Liu†4, Kangrui Wang1

Shiqi Chen5, Linjie Li6, Zhengyuan Yang7, Pingyue Zhang1, Yiping Lu1, Jiajun Wu8, Li Fei-Fei8

Lijuan Wang7, Yejin Choi8, Manling Li1

1Northwestern University   2UIUC   3University of British Columbia   4Imperial College London   5City University of Hong Kong   6University of Washington   7Microsoft   8Stanford University

Core Contributors    * Project Lead

Method

Framework

A Two-Axis View of Reasoning Quality

where X = input prompt, Z = model output:

H(Z)  =  I(X; Z)  +  H(Z | X)
Four quadrants of reasoning on Entropy H and MI I

Figure 1. Template collapse (top-left): entropy looks healthy, but outputs are input-agnostic.

Diagnosis

Detecting Template Collapse

Score every output against every input. In a healthy model, the diagonal dominates. Under collapse, all rows look the same.

Diverse Reasoning
X1X2X3
Z10.90.10.2
Z20.20.80.1
Z30.10.20.9
Retrieval Acc = 100%  ·  MI high
Template Collapse
X1X2X3
Z10.50.50.4
Z20.40.50.5
Z30.50.40.5
Retrieval Acc → random  ·  MI → 0
Mechanism

Why Collapse Happens: Signal vs. Noise

SNR mechanism diagram
gtotal  =  gtask  +  greg
Signal gtask — scales with reward variance. Low variance → weak signal.
Noise greg — KL/entropy penalty, constant regardless of input.

Low reward variance → noise dominates → outputs converge to a single template.

Intervention

RAGEN-V2: SNR-Adaptive Filtering

  1. Sample rollouts per prompt, compute reward variance.
  2. Keep top ρ fraction (high-signal prompts).
  3. Run PPO / GRPO on the retained subset.
Compatible with PPO · DAPO · GRPO · Dr. GRPO, etc.
RV-Filter overview

Experiments

Click each question to expand the evidence.

MI declines before performance — entropy stays high

Training dynamics: MI declines before task performance while entropy remains elevated

MI proxy declines before task performance degrades, while conditional entropy remains elevated — demonstrating that entropy alone is insufficient to detect template collapse.

Spearman correlation: MI vs Entropy as diagnostics

Spearman correlation: MI-family metrics correlate positively with task success; entropy metrics correlate negatively

MI-family metrics (blue) achieve positive correlations with task success (+0.39), while entropy metrics show near-zero or negative correlations (−0.11 to −0.14) — confirming entropy is directionally misleading.

Correlation table: RV vs candidate proxies

Variable Spearman ↑ Pearson ↑
Reward 0.630 0.650 Strong positive
Retrieval Acc (our diagnostic) 0.130 0.150 Weak positive
Response Length 0.120 0.080 Weak positive
MI (traj, est) -0.100 -0.170 Negative
Conditional Entropy -0.140 -0.180 Negative — misleading

Format validity cannot detect collapse either

Format validity vs MI and entropy: validity is decoupled from collapse metrics

Runs can remain highly valid while exhibiting low MI. Validity-based metrics alone are not sufficient to detect collapse — content-sensitive measures are needed.

Gradient decomposition by reward-variance buckets

Gradient decomposition by reward-variance buckets
RV buckets separate cleanly — 6 equal buckets by V̂ar(R|X) yield well-separated distributions.
Task gradient ↑ with RV — consistent with ‖gtask‖ ≤ c·√Var(R|X).
Regularization gradient is flat across buckets — input-agnostic as predicted.

Extended gradient decomposition at step 101

Gradient decomposition at step 101 for PPO and GRPO

Complementary gradient analysis at training step 101 under both PPO (top) and GRPO (bottom). The same three trends hold: RV increases Q1→Q6, task gradient scales with RV, and regularization gradient stays flat.

Reward variance decreases over training

Prompt-level reward and RV evolution across early, mid, and late training

From early to late training, the hard-prompt subset contracts while mixed prompts expand, and prompt-level RV shifts downward — rollout rewards become progressively more uniform, directly driving template collapse.

Adaptive filtering responds to SNR dynamics

Effective kept ratio and zero-variance prompts over training

The kept ratio decreases as training progresses and more prompts yield near-zero reward variance. RV-aware filtering is adaptive: it applies stronger selection pressure precisely when task-discriminative signal is weakest.

Filtering across diverse environments

Top-p vs Top-k vs No Filtering across Sokoban, SearchQA, WebShop, DeepCoder

Top-p Filtering (orange) consistently outperforms both Top-k Filtering (blue) and No Filtering (gray) across Sokoban, SearchQA, WebShop, and DeepCoder — adaptive proportional selection is more effective than fixed-count filtering.

Comprehensive results matrix

Experiment Variant Sokoban FrozenLake MetaMathQA Countdown Average Δ
Baseline Algorithm
PPO · Qwen2.5-3B 12.9 (+16.0) 67.0 (+10.9) 92.6 (+0.6) 97.9 (+0.0) +6.9
RL Algorithm (Qwen2.5-3B)
DAPO 16.2 (+5.1) 66.8 (+2.1) 90.8 (+2.8) 95.7 (+1.6) +2.9
GRPO 12.1 (+9.0) 70.9 (+2.3) 91.2 (+1.2) 95.7 (+2.2) +3.7
Dr. GRPO 12.1 (-0.4) 23.2 (+0.6) 91.2 (+1.4) 96.5 (+1.4) +0.8
Model Scale (PPO)
Qwen2.5-0.5B 3.3 (+22.9) 19.5 (+0.0) 10.0 (-0.2) 23.0 (-0.7) +5.5
Qwen2.5-1.5B 17.0 (+6.2) 36.5 (+1.6) 80.3 (+7.0) 56.6 (+1.6) +4.1
Qwen2.5-7B 42.4 (+4.9) 85.0 (-0.6) 84.0 (+11.7) 97.7 (+0.3) +4.1
Model Type (PPO)
Qwen2.5-3B-Instruct 22.5 (+14.2) 83.6 (+2.3) 91.2 (+0.4) 96.3 (-0.6) +4.1
Llama3.2-3B 24.4 (+18.8) 84.6 (-0.2) 86.1 (+3.7) 99.2 (-1.2) +5.3
Modality — Qwen2.5-VL-3B (PPO)
Text input 53.0 (+6.0) 16.0 (+53.5) +29.8
Image input 65.0 (+12.0) 19.5 (+59.5) +35.8
FrozenLake: success rate vs stochasticity for Top-p filtering vs no filtering
More randomness → lower success. Median success rates decrease as stochasticity increases from 0% to 100% for both filtered and unfiltered runs.
RV filtering helps at 0–50% stochasticity. Top-p filtering maintains a clear advantage in this range, consistent with the SNR view.
Gap closes at 80–100%. High transition noise weakens reward variance as an informative signal proxy, reducing the filter’s effectiveness.
RV-filter performance heatmap across top-p thresholds and training steps

Each cell reports validation success rate at the corresponding checkpoint. Moderate filtering (top-p ∈ [0.8, 0.98]) consistently achieves the highest performance. Overly aggressive filtering (top-p ≤ 0.6) or no filtering (top-p = 1.0) yields lower returns.

Sweet spot at top-p 0.8–0.98 — retains enough informative rollouts while discarding low-signal prompts.
Too aggressive (top-p ≤ 0.6) — narrows exploration coverage, reducing effective training signal.
No filtering (top-p = 1.0) — low-RV prompts inject noise, leading to template collapse.
Reasoning length over training steps across eight environments

Output token count over training steps across eight environments. All environments exhibit consistent decline (−5% to −66%), across model sizes (3B and 7B) and both with and without RV filtering.

Universal phenomenon — reasoning length reduction is consistent across all environments, not an artifact of any specific intervention.
Scale-invariant — the trend holds across both 3B and 7B models.
Filter-independent — occurs with and without RV filtering, suggesting it is a general property of agent RL optimization.

Citation

If you find this work useful, please cite:

@article{ragenv2026collapse,
  title={RAGEN-v2: Understanding Reasoning Collapse in Multi-turn Agent Reinforcement Learning},
  author={Zihan Wang and Chi Gui and Xing Jin and Qineng Wang and Licheng Liu and Kangrui Wang and Shiqi Chen and Linjie Li and Zhengyuan Yang and Pingyue Zhang and Yiping Lu and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
  year={2026}
}