VAGEN
Training VLM agents with multi-turn reinforcement learning
Reinforcing Visual State Reasoning for Multi-Turn VLM Agents

Animated Case Studies










Method
Problem Formulation: VLM Agent Training under POMDP
We frame multi-turn VLM agentic tasks as a Partially Observable Markov Decision Process (POMDP), represented by the tuple \((S, O, \phi, A, p, r, \gamma)\), where \(S\) denotes the set of environment states, and \(O\) is the space of observations perceived by the agent.
Each observation \(o_t \in O\) is a partial view of the environment state \(s_t \in S\), given by the observation function \(\phi\). The agent's objective is to learn a policy \(\pi_\theta\) that maximizes the expected cumulative discounted return \[\max_\theta \, E_{\pi_\theta, p} \left[ \sum_{t=1}^{T} \gamma^{t-1} r_t \right]\].
In our setting, the policy \(\pi_\theta\) is parameterized by a VLM that takes in visual images with their prompts as observations, and outputs language token sequences as actions.
Multi-Turn Reinforcement Learning with Visual State Reasoning
Our training algorithm optimizes multi-turn interactions to better address the demands of agentic tasks, with specifications to VLMs in multi-turn, trajectory-based optimization setting.
Trajectory Rollout with Visual State Reasoning
Each trajectory begins with an initial observation \(o_0\) provided by the environment. The agent generates a structured output \(a_t = \langle z_t, \bar{a}_t \rangle\), where \(z_t\) represents reasoning tokens and \(\bar{a}_t\) represents executable actions.
Visual State Reasoning Strategies:
NoThink: Direct action generation without explicit reasoning
<answer>...</answer>
FreeThink: Emergent reasoning without specific structure
<think>...</think><answer>...</answer>
Grounding: Explicit current state description
<think><observation>...</observation>...</think><answer>...</answer>
Learning: \(o_t \rightarrow s_t\)
WorldModeling: Explicit future state prediction
<think>...<prediction>...</prediction></think><answer>...</answer>
Learning: \(o_t, a_t \rightarrow s_{t+1}\)
Grounding+WorldModeling: Combined current and future state reasoning
<think><observation>...</observation><reasoning>...</reasoning><prediction>...</prediction></think><answer>...</answer>
Learning: \(o_t \rightarrow s_t, \ \ s_t, a_t \rightarrow s_{t+1}\)
Advantage Estimation with Masked GAE
We use a modified form of Generalized Advantage Estimation (GAE) that applies masking to exclude tokens generated by the environment (i.e., non-action tokens) from advantage estimation and loss computation. This ensures that only relevant tokens contribute to the learning signal.
Policy Update with PPO
We update the actor and critic using the following formulas:
where \(M_i^{loss}\) masks non-action tokens. The trajectory collection, advantage estimation, and policy update iterate until convergence.
Boost #1: Visual Reasoning Reward
We use LLM-as-Judge to reward the agent when its predicted or observed visual state matches the ground truth.

Boost #2: Bi-Level GAE
To address the limitation of only providing trajectory-level feedback, we propose Bi-Level GAE, which delivers fine-grained turn-level reward signals. This approach assigns rewards at the end of each action and introduces two discount factors: one for tokens within a turn, and one for transitions across turns.

Bi-Level GAE framework illustration.
Results

Explicitly visual states reasoning is crucial for VLM agents.

- Bi-Level GAE alone brings notable but unstable improvements, being sensitive to reward sparsity and less stable in sparse environments.
- The Visual Reasoning Reward alone consistently boosts performance by providing essential visual learning signals, but is limited by coarse credit assignment.
- VAGEN-Full is the most robust and achieves strong, stable results across all tasks.
Cases
Case : Enhanced Visual State Reasoning with VAGEN-Full

Summary of Findings
Finding 1: Explicit Visual State Reasoning is Crucial for Multi-Turn VLM Agents
Vanilla VLMs struggle with multi-turn agentic tasks requiring visual state understanding. Integrating explicit visual state reasoning steps—specifically Grounding and World Modeling—into the VLM's thinking process during RL training significantly enhances task performance. The combined Grounding-WorldModeling strategy, in particular, demonstrates strong and stable performance, enabling a trained open-source VLM to outperform its un-trained counterpart and even surpass benchmarked proprietary models.
Finding 2: Optimal Visual State Representation is Task-Dependent
The choice of representation for visual states during explicit reasoning significantly impacts performance:
- Natural Language: Performs consistently well, especially when structured information must be inferred from raw visual input.
- Structured Formats: Excel in manipulation-heavy tasks (e.g., PrimitiveSkill) where object-centric state abstractions are readily available.
- Symbolic Representations: Proved less effective due to the model's limited prior interpretability from visual input.
Finding 3: Visual Reasoning RL with Targeted Rewards and Bi-Level GAE Enhances Reasoning Quality and Task Success
To specifically improve visual state reasoning, Visual Reasoning RL incorporates:
- Turn-level Visual Reasoning Reward: An LLM-as-a-Judge assesses the accuracy of the VLM's explicit state descriptions and predictions, effectively supervising reasoning.
- Bi-Level General Advantage Estimation (GAE): Estimates advantages at both turn and token levels, providing finer-grained reward signals and improving credit assignment.
This approach consistently outperforms Base RL, leading to improved reasoning quality, higher task success rates, and better generalization.
Finding 4: Emergent Reasoning Patterns and Challenges
Beyond quantitative measurements, we qualitatively analyzed how agents learn to reason:
- Reasoning Stability Varies by Task: While reasoning in tasks like Navigation and PrimitiveSkill (and often Sokoban) appears relatively normal and beneficial with explicit rewards, tasks like FrozenLake show more erratic reasoning patterns, potentially correlating with its lower performance and the difficulty of its visual state reasoning.
- Potential for Reward Hacking: Instances of "reward hacking" were observed, particularly with certain reward configurations. Agents might learn to generate reasoning-like text that satisfies the reward mechanism without genuinely reflecting deep understanding or accurate future prediction.
- Bi-Level GAE as a Double-Edged Sword: While Bi-Level GAE can improve credit assignment, its interaction with visual reasoning rewards might sometimes allow for more "divergent" or less grounded thinking if the reasoning reward itself can be easily hacked.
- Convergence to Standardized Phrasing: Agents across different environments tend to converge towards using a more uniform, templated sentence structure for their reasoning and actions over prolonged training, primarily varying only the directional or specific action tokens.
- Rule-Based Filtering as a Potential Mitigation: For simpler forms of reward hacking where reasoning outputs fail basic semantic checks, simple rule-based filtering before reward assignment could be a pragmatic interim solution.
These observations underscore that while explicit reasoning and rewards are beneficial, the design of these rewards must be robust against exploitation, and continuous monitoring of reasoning quality is essential.