Icon RAGEN
RAGEN Icon

RAGEN

Training Agents by Reinforcing Reasoning

RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.

Get Started
1.4k
105
Diagram placeholder

Comparison between RAGEN and existing LLM training methods.

Zihan Wang*1, Kangrui Wang*1, Qineng Wang*1, Pingyue Zhang*1, Linjie Li*2

Zhengyuan Yang4, Kefan Yu1, Minh Nhat Nguyen6, Licheng Liu7, Eli Gottlieb1, Monica Lam3,

Yiping Lu1 Kyunghyun Cho5, Jiajun Wu3, Li Fei-Fei3, Lijuan Wang4, Yejin Choi3, Manling Li1

* Equal Contribution

1 Northwestern University    2 University of Washington    3 Stanford University    4 Microsoft   
5 New York University    6 Singapore Management University    7 Imperial College London   

StarPO (State-Thinking-Action-Reward Policy Optimization)

📋

Initial State

🧠

Reasoning

Action

🏆

Reward

Repeat
StarPO Framework

The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage.

The framework consists of two key components:

MDP Formulation

-

We formulate agent-environment interactions as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics.

State \(s_t\)
token sequence
Action \(a_t\)
State \(s_{t+1}\)
new token sequence

At time \(t\), state \(s_t\) transitions to the next state through action \(a_t\) following a transition function \(P(s_{t+1} | s_t, a_t)\). The policy \(\pi(a_t | s_t)\) generates actions given the trajectory history. The objective is to maximize expected cumulative rewards \(\mathbb{E}_\pi[\sum_t \gamma^t r_t]\) across multiple interaction turns.

StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

-

StarPO is a general RL framework for optimizing entire multi-turn interaction trajectories for LLM agents. The algorithm alternates between two phases:

Rollout Stage: Reasoning-Interaction Trajectories

Given an initial state, the LLM generates multiple trajectories. At each step, the model receives the trajectory history and generates a reasoning-guided action:

<think>...reasoning process...</think><ans> action </ans>

The environment receives the action and returns feedback (reward and next state).

🤖
I need to solve this equation by first isolating the variable...
x = 5
🌐

Update Stage: Multi-turn Trajectory Optimization

After generating trajectories, we train LLMs to optimize expected rewards. Instead of step-by-step optimization, StarPO optimizes entire trajectories using importance sampling. This approach enables long-horizon reasoning while maintaining computational efficiency.

StarPO supports multiple optimization strategies:

PPO
GRPO

PPO (Proximal Policy Optimization): We estimate token-level advantages using a value function over trajectories

\[L(\theta) = \mathbb{E}[\min(r_t(\theta)A^\pi, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)A^\pi)]\]

GRPO (General Reward Policy Optimization): We assign normalized reward to the full trajectory

\[L(\theta) = \mathbb{E}[R(\tau) \cdot \log \pi_\theta(\tau)]\]

Rollout and update stages interleave in StarPO, enabling both online and offline learning.

RAGEN Trajectory Examples

Explore agent trajectories across different tasks. View state transitions, LLM-generated actions, and the decision-making process.

Loading trajectory data...

Step 1 of 5

State

Current state

State description will appear here. This represents the environment's current state at the selected step.

LLM Action

Reasoning:

Let me think about the current state...

Action:

move_pawn('e2', 'e4')

Findings

Key findings from our research on LLM reasoning stability and reinforcement learning dynamics.

Finding 1: Multi-turn training introduces new instability patterns

-

Adaptations from single-turn RL methods like PPO and GRPO achieve early gains in agent settings but often collapse. A critic in PPO may delay instability, but would not prevent reasoning degradation, highlighting the need for specialized stabilization in agent settings.

Finding 2: Model collapse in agent RL is reflected as "Echo Trap" over training

-

We find that early-stage agents respond with diverse symbolic reasoning, but collapse into deterministic, repetitive templates after training. Models converge to fixed phrasing, indicating that RL may reinforce superficial patterns instead of general reasoning and forms an "Echo Trap" that hinders long-term generalization.

Finding 3: Collapse follows similar dynamics and can be anticipated by indicators

-

Reward standard deviation and entropy often fluctuate before performance degrades, while gradient norm spikes typically mark the point of irreversible collapse. These metrics provide early indicators and motivate the need for stabilization strategies.

Finding 4: Uncertainty-based filtering improves training stability and efficiency

-

Filtering training data based on reward variance effectively combats the "Echo Trap". Retaining only the highly-uncertain training instances delays or prevents collapse across tasks and improves data efficiency.

Finding 5: Task diversity, action budget, and rollout frequency affect data quality

-

Diverse task instances enable better policy contrast and generalization across environments. Suitable action budgets provide enough planning space and avoid the noise introduced by overly long sequences. Up-to-date rollouts ensure optimization targets remain aligned with current policy behavior.

Finding 6: Reasoning fails to emerge without meticulous reward design

-
While Symbolic reasoning emerges naturally in single-turn tasks under weak supervision, it fails to persist in multi-turn environments without the reward design explicitly encouraging interpretable intermediate reasoning steps. We observe that even with structured prompts, reasoning gradually decays during training if the reward signal focuses only on final outcomes. This suggests that without meticulous reward shaping, agents may tend to collapse into shortcut behaviors that bypass reasoning altogether.

Citation

If you find RAGEN useful in your research, please consider citing our paper:

@misc{RAGEN,
  author       = {Zihan Wang* and Kangrui Wang* and Qineng Wang* and Pingyue Zhang* and Linjie Li* and Zhengyuan Yang and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Monica Lam and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
  title        = {Training Agents by Reinforcing Reasoning},
  year         = {2025},
  organization = {GitHub},
  url          = {https://github.com/ZihanWang314/ragen},
}