Icon VAGEN

VAGEN

Training VLM agents with multi-turn reinforcement learning

Reinforcing Visual State Reasoning for Multi-Turn VLM Agents

Get Started
--
--
VAGEN Framework Diagram

Animated Case Studies

FrozenLake Case 1 FrozenLake Case 2
FrozenLake
Navigation Case 1 Navigation Case 2
Navigation
Sokoban Case 1 Sokoban Case 2
Sokoban
ManiSkill Case 1 ManiSkill Case 2
ManiSkill
SVG Generate SVG Target
SVG
* For SVG: Top is the model's generated SVG, Bottom is the target SVG.

Method

Problem Formulation: VLM Agent Training under POMDP

-

We frame multi-turn VLM agentic tasks as a Partially Observable Markov Decision Process (POMDP), represented by the tuple \((S, O, \phi, A, p, r, \gamma)\), where \(S\) denotes the set of environment states, and \(O\) is the space of observations perceived by the agent.

Environment State \(s_t \in S\)
full environment state
Observation \(o_t = \phi(s_t)\)
Agent Observation \(o_t \in O\)
visual image + prompt

Each observation \(o_t \in O\) is a partial view of the environment state \(s_t \in S\), given by the observation function \(\phi\). The agent's objective is to learn a policy \(\pi_\theta\) that maximizes the expected cumulative discounted return \[\max_\theta \, E_{\pi_\theta, p} \left[ \sum_{t=1}^{T} \gamma^{t-1} r_t \right]\].

In our setting, the policy \(\pi_\theta\) is parameterized by a VLM that takes in visual images with their prompts as observations, and outputs language token sequences as actions.

Multi-Turn Reinforcement Learning with Visual State Reasoning

-

Our training algorithm optimizes multi-turn interactions to better address the demands of agentic tasks, with specifications to VLMs in multi-turn, trajectory-based optimization setting.

Trajectory Rollout with Visual State Reasoning

Each trajectory begins with an initial observation \(o_0\) provided by the environment. The agent generates a structured output \(a_t = \langle z_t, \bar{a}_t \rangle\), where \(z_t\) represents reasoning tokens and \(\bar{a}_t\) represents executable actions.

Visual State Reasoning Strategies:

NoThink
FreeThink
Grounding
WorldModeling
Grounding+WM

NoThink: Direct action generation without explicit reasoning

<answer>...</answer>

FreeThink: Emergent reasoning without specific structure

<think>...</think><answer>...</answer>

Grounding: Explicit current state description

<think><observation>...</observation>...</think><answer>...</answer>

Learning: \(o_t \rightarrow s_t\)

WorldModeling: Explicit future state prediction

<think>...<prediction>...</prediction></think><answer>...</answer>

Learning: \(o_t, a_t \rightarrow s_{t+1}\)

Grounding+WorldModeling: Combined current and future state reasoning

<think><observation>...</observation><reasoning>...</reasoning><prediction>...</prediction></think><answer>...</answer>

Learning: \(o_t \rightarrow s_t, \ \ s_t, a_t \rightarrow s_{t+1}\)

Advantage Estimation with Masked GAE

We use a modified form of Generalized Advantage Estimation (GAE) that applies masking to exclude tokens generated by the environment (i.e., non-action tokens) from advantage estimation and loss computation. This ensures that only relevant tokens contribute to the learning signal.

+ Show Code

Policy Update with PPO

We update the actor and critic using the following formulas:

\[J^{PPO}(\theta) = \frac{1}{\sum_i M_i^{loss}} \sum_i M_i^{loss} \cdot \min \left( r_i(\theta) A_i^{GAE}, clip(r_i(\theta), 1 - \varepsilon, 1 + \varepsilon) A_i^{GAE} \right)\]
\[J^{\text{Critic}}(\phi) = \frac{1}{\sum_i M_i^{\text{loss}}} \sum_i M_i^{\text{loss}} \cdot \left( V_\phi(s_i) - \hat{R}_i \right)^2\]

where \(M_i^{loss}\) masks non-action tokens. The trajectory collection, advantage estimation, and policy update iterate until convergence.

Boost #1: Visual Reasoning Reward

-

We use LLM-as-Judge to reward the agent when its predicted or observed visual state matches the ground truth.

Boost #2: Bi-Level GAE

-

To address the limitation of only providing trajectory-level feedback, we propose Bi-Level GAE, which delivers fine-grained turn-level reward signals. This approach assigns rewards at the end of each action and introduces two discount factors: one for tokens within a turn, and one for transitions across turns.

Bi-Level GAE Diagram

Bi-Level GAE framework illustration.

+ Show Code

Results

Main Results Table

Explicitly visual states reasoning is crucial for VLM agents.

Ablation Study
  • Bi-Level GAE alone brings notable but unstable improvements, being sensitive to reward sparsity and less stable in sparse environments.
  • The Visual Reasoning Reward alone consistently boosts performance by providing essential visual learning signals, but is limited by coarse credit assignment.
  • VAGEN-Full is the most robust and achieves strong, stable results across all tasks.

Cases

Case : Enhanced Visual State Reasoning with VAGEN-Full

+
Case Study Appendix
VAGEN-Full vs. VAGEN-Base in Navigation, Sokoban and Frozenlake (left to right). For the same environment, VAGEN-Full is the left column, VAGEN-Base is the right column.

Summary of Findings

Finding 1: Explicit Visual State Reasoning is Crucial for Multi-Turn VLM Agents

-

Vanilla VLMs struggle with multi-turn agentic tasks requiring visual state understanding. Integrating explicit visual state reasoning steps—specifically Grounding and World Modeling—into the VLM's thinking process during RL training significantly enhances task performance. The combined Grounding-WorldModeling strategy, in particular, demonstrates strong and stable performance, enabling a trained open-source VLM to outperform its un-trained counterpart and even surpass benchmarked proprietary models.

Finding 2: Optimal Visual State Representation is Task-Dependent

-

The choice of representation for visual states during explicit reasoning significantly impacts performance:

  • Natural Language: Performs consistently well, especially when structured information must be inferred from raw visual input.
  • Structured Formats: Excel in manipulation-heavy tasks (e.g., PrimitiveSkill) where object-centric state abstractions are readily available.
  • Symbolic Representations: Proved less effective due to the model's limited prior interpretability from visual input.

Finding 3: Visual Reasoning RL with Targeted Rewards and Bi-Level GAE Enhances Reasoning Quality and Task Success

-

To specifically improve visual state reasoning, Visual Reasoning RL incorporates:

  • Turn-level Visual Reasoning Reward: An LLM-as-a-Judge assesses the accuracy of the VLM's explicit state descriptions and predictions, effectively supervising reasoning.
  • Bi-Level General Advantage Estimation (GAE): Estimates advantages at both turn and token levels, providing finer-grained reward signals and improving credit assignment.

This approach consistently outperforms Base RL, leading to improved reasoning quality, higher task success rates, and better generalization.

Finding 4: Emergent Reasoning Patterns and Challenges

-

Beyond quantitative measurements, we qualitatively analyzed how agents learn to reason:

  • Reasoning Stability Varies by Task: While reasoning in tasks like Navigation and PrimitiveSkill (and often Sokoban) appears relatively normal and beneficial with explicit rewards, tasks like FrozenLake show more erratic reasoning patterns, potentially correlating with its lower performance and the difficulty of its visual state reasoning.
  • Potential for Reward Hacking: Instances of "reward hacking" were observed, particularly with certain reward configurations. Agents might learn to generate reasoning-like text that satisfies the reward mechanism without genuinely reflecting deep understanding or accurate future prediction.
  • Bi-Level GAE as a Double-Edged Sword: While Bi-Level GAE can improve credit assignment, its interaction with visual reasoning rewards might sometimes allow for more "divergent" or less grounded thinking if the reasoning reward itself can be easily hacked.
  • Convergence to Standardized Phrasing: Agents across different environments tend to converge towards using a more uniform, templated sentence structure for their reasoning and actions over prolonged training, primarily varying only the directional or specific action tokens.
  • Rule-Based Filtering as a Potential Mitigation: For simpler forms of reward hacking where reasoning outputs fail basic semantic checks, simple rule-based filtering before reward assignment could be a pragmatic interim solution.

These observations underscore that while explicit reasoning and rewards are beneficial, the design of these rewards must be robust against exploitation, and continuous monitoring of reasoning quality is essential.

Citation

If you find VAGEN useful in your research, we would appreciate it if you consider citing our work:

@misc{wang2025vagen,
  title={Reinforcing Visual State Reasoning for Multi-Turn VLM Agents},
  author={Kangrui Wang* and Pingyue Zhang* and Zihan Wang* and Yaning Gao* and Linjie Li* and Qineng Wang and Hanyang Chen and Chi Wan and Yiping Lu and Zhengyuan Yang and Lijuan Wang and Ranjay Krishna and Jiajun Wu and Li Fei-Fei and Yejin Choi and Manling Li},
  year={2025},
  url={https://github.com/RAGEN-AI/VAGEN}
}