BAGEN

Are LLM Agents Budget-Aware?

NeurIPS 2026

Foundation-model agents are deployed with growing resource constraints like token, money, and time budgets, yet it remains unclear whether they know how much budget they will spend. We formalize budget awareness as progressive interval estimation: mid-execution, can the agent provide a calibrated interval on remaining budget and detect when a task is no longer finishable?

Yuxiang Lin^*1,2, Zihan “Zenus” Wang^*1,2, Mengyang Liu^*3, Yuxuan Shan^*2, Longju Bai^*4

Junyao Zhang², Xing Jin³, Boshan Chen², Jinyan Su⁵, Xingyao Wang⁶, Jiaxin Pei^7,8, Manling Li¹

^* Core Contributors

¹Northwestern ²O2 AI ³Independent ⁴University of Michigan ⁵Cornell ⁶All Hands AI ⁷Stanford ⁸UT Austin

Code (coming soon) Paper (coming soon) Citation

🎯

Budget awareness decouples from task performance — better agents do not reliably estimate their own costs (r ≈ 0.35).

📈

Universal optimistic bias — all models underestimate remaining budget; weaker models are more optimistic, not less.

⚡

Actionable signal — early stopping on impossible predictions saves 28–64% of wasted tokens at only 1.6–4.2 pp success cost.

🏭

Trainable — SFT alone raises feasibility from 25.5% to ~90%. Interval coverage caps near 47% after SFT+RL, exposing a hard reasoning gap.

Method

Setup

Two Budget Modalities

Internal Budget

The compute the model generates — primarily token consumption across reasoning steps. Tested on Sokoban, Search-R1, and SWE-bench.

C_kⁱⁿ = ∑_t=1^k c_tⁱⁿ

External Budget

The cost the agent commits in the environment — money, time, warehouse capacity. Multi-dimensional, coupled constraints. Tested on our Warehouse environment.

C_k^ex = ∑_t=1^k c_t^ex ∈ ℝ^D

Protocol

Progressive Interval Estimation + Rollout-Replay

At every turn k, the estimator returns either a budget interval or an impossible declaration:

ŷ_k = [R̂_k^lo, R̂_k^hi] if task still feasible impossible if completion no longer achievable

1

Rollout generation — run the agent without any budget cap, log the full trajectory with per-turn costs and final outcome.

2

Prefix replay & estimation — for each non-terminal turn, replay the logged prefix and ask the agent for ŷ_k, scored against true remaining cost R_k.

This separates estimation ability from task completion, since estimation itself would consume tokens if done inline.

Metrics

Three Sub-Capability Metrics

①

Feasibility Prediction

Can the agent tell whether the task will succeed? Scored as Macro-F₁ over feasible / impossible classes, at first turn and all turns.

②

Early Failure Detection

For tasks that ultimately fail, does the agent raise the alarm early? Scored as Fail-F₁ on the impossible class only.

③

Interval Calibration

On successful trajectories, how accurate and tight is the predicted interval? Scored as coverage × tightness, plus midpoint relative error.

Setup

4 Environments × 5 Frontier Models

Sokoban

Internal

8×8 grid planning task, 2,500-token cap.

Search-R1

Internal

Multi-hop information retrieval, 3,500-token cap.

SWE-bench

Internal

GitHub issue resolution, 160-turn cap.

Warehouse

External

Supply-chain from real enterprise data. 3 coupled budgets: cost (USD), time (weeks), occupancy (item-weeks).

GPT-5.2 Instant Claude Opus 4.7 Claude Sonnet 4.6 Gemini 3.1 Pro Qwen3-235B

Key Findings

Click each finding to expand the evidence.

Left & middle: task success rate correlates only weakly with budget estimation quality for both internal and external budgets. Right: on failed Sokoban trajectories, estimation accuracy improves as more progress is observed — but the largest gains appear only very late.

⊕

On Search-R1, Opus achieves the highest task success rate (75.8%) but Sonnet produces better interval estimates (36.5% hit rate vs. 23.1% for Opus).

⊕

On SWE-bench, rankings split three ways: Opus leads task success, Gemini leads feasibility F₁, GPT-5.2 leads interval hit rate.

⊕

What separates good estimators is calibration, not feasibility. Interval hit rate correlates strongly with midpoint bias (r ≈ −0.67) but only weakly with feasibility F₁ (r ≈ 0.35).

Full Results Table

Model	Success	F1@1	F1@All	Fail-F1	Hit Rate	Reward
SWE-bench
Claude Opus 4.7	71.9%	41.1%	51.1%	48.8%	30.3%	0.160
Claude Sonnet 4.6	68.8%	32.2%	37.7%	23.4%	22.3%	0.130
Gemini 3.1 Pro	68.8%	39.2%	58.2%	52.0%	23.2%	0.112
GPT-5.2 Instant	57.8%	43.5%	40.2%	21.2%	44.3%	0.115
Qwen3-235B	33.3%	47.6%	35.1%	32.8%	6.5%	0.021
Search-R1
Claude Opus 4.7	75.8%	39.4%	40.5%	5.6%	23.1%	0.114
Claude Sonnet 4.6	71.1%	37.9%	33.3%	0.0%	36.5%	0.154
GPT-5.2 Instant	68.0%	40.2%	38.3%	0.0%	21.4%	0.031
Gemini 3.1 Pro	53.9%	24.5%	24.8%	0.0%	20.7%	0.079
Qwen3-235B	35.2%	33.2%	23.9%	30.9%	0.0%	0.000
Sokoban
Claude Opus 4.7	56.2%	46.3%	45.6%	16.0%	46.4%	0.112
Claude Sonnet 4.6	51.6%	46.4%	53.6%	33.9%	45.1%	0.148
Gemini 3.1 Pro	39.1%	40.0%	61.9%	79.9%	8.8%	0.313
GPT-5.2 Instant	35.2%	27.7%	40.6%	32.8%	36.0%	0.167
Qwen3-235B	7.0%	6.3%	12.6%	20.4%	10.8%	0.029
Warehouse (external budget)
GPT-5.2 Instant	—	35.0%	63.4%	56.9%	24.7%	0.577
Claude Opus 4.7	—	33.3%	63.2%	55.7%	35.9%	0.690
Claude Sonnet 4.6	—	33.3%	64.9%	59.0%	17.3%	0.572
Gemini 3.1 Pro	—	42.0%	67.0%	62.8%	50.2%	0.698
Qwen3-235B	—	41.0%	60.8%	56.0%	17.3%	0.483

Cumulative optimism across models

Models generally estimate budget too optimistically throughout the rollout. Conservative bias increases with rollout progress, but remains secondary to optimism overall.

First-turn vs. all-turn feasibility F₁

Points scatter on both sides of the equality line — first-turn predictions do not reliably summarize what the same agent would say after seeing partial progress. On Sokoban, Gemini improves by +21.9 F₁ points; on Search-R1, Qwen3-235B drops 9.3 points.

⊕

Optimistic misses dominate at every progress bin for all 20 model-environment pairs — the bias is not an artifact of early turns.

⊕

Bias tracks model confidence, not task difficulty. Within an environment, weaker models are more optimistic — consistent with overconfidence under limited self-awareness.

⊕

First-turn estimates are unstable. Neither a consistent over- nor underestimate of all-turn judgment — which direction depends on the specific model and task.

Failure is recognized late: across environments, models often label failed trajectories as impossible only after much of the token budget has already been spent.

⚠

On failed trajectories, feasibility predictions stay above 70% even after 60% of budget is consumed — the prediction drops sharply only in the final 20%.

⚠

This late recognition wastes substantial compute on trajectories that are already doomed — an opportunity for early-stop policies to intervene.

Early Stopping Policy Results

A simple policy: terminate whenever the model predicts impossible. False aborts stop a would-have-succeeded trajectory; false continues miss a chance to save compute.

Model	False-Abort Rate	Tokens Saved	Stopped Failed Rollouts
GPT-5.2 Instant	6.6%	64.1%	124 / 215
Claude Opus 4.7	2.2%	28.2%	62 / 169
Claude Sonnet 4.6	3.3%	49.6%	101 / 183
Gemini 3.1 Pro	2.8%	55.7%	123 / 221
Qwen3-235B	4.9%	38.8%	140 / 306

SFT + RL training on Sokoban

Left: coverage vs. midpoint error for SFT checkpoints and their SFT+RL continuations. Right: reward before and after RL from the same SFT starts. RL improves estimation only when starting from a suitable SFT initialization — RL without SFT collapses.

SFT interval-width ablation

SFT choices control the estimator's behavior. Wider interval targets improve coverage but increase midpoint error. Longer SFT makes the model more conservative by reducing feasible predictions and improving recall on impossible cases.

⊕

Feasibility is a calibration problem. Base Qwen-7B: 25.5% accuracy. SFT alone raises it to ~90%. The capability was already there — the model just needed the right format.

⊕

Interval estimation is a reasoning problem. Base: 10.5% coverage. SFT: 26–53% depending on target width. SFT+RL: 47% with 28% median midpoint error. Nearly half of intervals still miss the true remaining budget.

⊖

RL without SFT collapses entirely. The model either labels everything impossible or emits invalid formats — RL cannot recover capability from sparse reward on its own.

Citation

If you find this work useful, please cite:

@misc{lin2026bagen,
  title={BAGEN: Are LLM Agents Budget-Aware?},
  author={Yuxiang Lin and Zihan Wang and Mengyang Liu and Yuxuan Shan and Longju Bai and Junyao Zhang and Xing Jin and Boshan Chen and Jinyan Su and Xingyao Wang and Jiaxin Pei and Manling Li},
  year={2026},
  note={NeurIPS 2026},
}

BAGEN

Method

Two Budget Modalities

Progressive Interval Estimation + Rollout-Replay

Three Sub-Capability Metrics

4 Environments × 5 Frontier Models

Key Findings

Budget Awareness Decouples from Task Performance

Full Results Table

Optimistic Bias Is Universal Across All Models and Tasks

Cumulative optimism across models

First-turn vs. all-turn feasibility F₁

Failure Is Recognized Too Late to Act On

Early Stopping Policy Results

Binary Feasibility Is Calibration; Interval Estimation Is Reasoning

SFT + RL training on Sokoban

SFT interval-width ablation

Citation