RV-Filter Logo RAGEN-2

Are LLM Agents Budget-Aware?

NeurIPS 2026

Foundation-model agents are deployed with growing resource constraints like token, money, and time budgets, yet it remains unclear whether they know how much budget they will spend. We formalize budget awareness as progressive interval estimation: mid-execution, can the agent provide a calibrated interval on remaining budget and detect when a task is no longer finishable?

Yuxiang Lin*1,2, Zihan “Zenus” Wang*1,2, Mengyang Liu*3, Yuxuan Shan*2, Longju Bai*4

Junyao Zhang2, Xing Jin3, Boshan Chen2, Jinyan Su5, Xingyao Wang6, Jiaxin Pei7,8, Manling Li1

* Core Contributors

1Northwestern   2O2 AI   3Independent   4University of Michigan   5Cornell   6All Hands AI   7Stanford   8UT Austin

🎯
Budget awareness decouples from task performance — better agents do not reliably estimate their own costs (r ≈ 0.35).
📈
Universal optimistic bias — all models underestimate remaining budget; weaker models are more optimistic, not less.
Actionable signal — early stopping on impossible predictions saves 28–64% of wasted tokens at only 1.6–4.2 pp success cost.
🏭
Trainable — SFT alone raises feasibility from 25.5% to ~90%. Interval coverage caps near 47% after SFT+RL, exposing a hard reasoning gap.

Method

Setup

Two Budget Modalities

Internal Budget

The compute the model generates — primarily token consumption across reasoning steps. Tested on Sokoban, Search-R1, and SWE-bench.

Ckin = ∑t=1k ctin
External Budget

The cost the agent commits in the environment — money, time, warehouse capacity. Multi-dimensional, coupled constraints. Tested on our Warehouse environment.

Ckex = ∑t=1k ctex ∈ ℝD
Protocol

Progressive Interval Estimation + Rollout-Replay

At every turn k, the estimator returns either a budget interval or an impossible declaration:

ŷk = [R̂klo, R̂khi] if task still feasible impossible if completion no longer achievable
1
Rollout generation — run the agent without any budget cap, log the full trajectory with per-turn costs and final outcome.
2
Prefix replay & estimation — for each non-terminal turn, replay the logged prefix and ask the agent for ŷk, scored against true remaining cost Rk.

This separates estimation ability from task completion, since estimation itself would consume tokens if done inline.

Metrics

Three Sub-Capability Metrics

Feasibility Prediction

Can the agent tell whether the task will succeed? Scored as Macro-F₁ over feasible / impossible classes, at first turn and all turns.

Early Failure Detection

For tasks that ultimately fail, does the agent raise the alarm early? Scored as Fail-F₁ on the impossible class only.

Interval Calibration

On successful trajectories, how accurate and tight is the predicted interval? Scored as coverage × tightness, plus midpoint relative error.

Setup

4 Environments × 5 Frontier Models

Sokoban
Internal

8×8 grid planning task, 2,500-token cap.

Search-R1
Internal

Multi-hop information retrieval, 3,500-token cap.

SWE-bench
Internal

GitHub issue resolution, 160-turn cap.

Warehouse
External

Supply-chain from real enterprise data. 3 coupled budgets: cost (USD), time (weeks), occupancy (item-weeks).

GPT-5.2 Instant Claude Opus 4.7 Claude Sonnet 4.6 Gemini 3.1 Pro Qwen3-235B

Key Findings

Click each finding to expand the evidence.

Task performance vs estimation quality

Left & middle: task success rate correlates only weakly with budget estimation quality for both internal and external budgets. Right: on failed Sokoban trajectories, estimation accuracy improves as more progress is observed — but the largest gains appear only very late.

On Search-R1, Opus achieves the highest task success rate (75.8%) but Sonnet produces better interval estimates (36.5% hit rate vs. 23.1% for Opus).
On SWE-bench, rankings split three ways: Opus leads task success, Gemini leads feasibility F₁, GPT-5.2 leads interval hit rate.
What separates good estimators is calibration, not feasibility. Interval hit rate correlates strongly with midpoint bias (r ≈ −0.67) but only weakly with feasibility F₁ (r ≈ 0.35).

Full Results Table

Model Success F1@1 F1@All Fail-F1 Hit Rate Reward
SWE-bench
Claude Opus 4.771.9%41.1%51.1%48.8%30.3%0.160
Claude Sonnet 4.668.8%32.2%37.7%23.4%22.3%0.130
Gemini 3.1 Pro68.8%39.2%58.2%52.0%23.2%0.112
GPT-5.2 Instant57.8%43.5%40.2%21.2%44.3%0.115
Qwen3-235B33.3%47.6%35.1%32.8%6.5%0.021
Search-R1
Claude Opus 4.775.8%39.4%40.5%5.6%23.1%0.114
Claude Sonnet 4.671.1%37.9%33.3%0.0%36.5%0.154
GPT-5.2 Instant68.0%40.2%38.3%0.0%21.4%0.031
Gemini 3.1 Pro53.9%24.5%24.8%0.0%20.7%0.079
Qwen3-235B35.2%33.2%23.9%30.9%0.0%0.000
Sokoban
Claude Opus 4.756.2%46.3%45.6%16.0%46.4%0.112
Claude Sonnet 4.651.6%46.4%53.6%33.9%45.1%0.148
Gemini 3.1 Pro39.1%40.0%61.9%79.9%8.8%0.313
GPT-5.2 Instant35.2%27.7%40.6%32.8%36.0%0.167
Qwen3-235B7.0%6.3%12.6%20.4%10.8%0.029
Warehouse (external budget)
GPT-5.2 Instant35.0%63.4%56.9%24.7%0.577
Claude Opus 4.733.3%63.2%55.7%35.9%0.690
Claude Sonnet 4.633.3%64.9%59.0%17.3%0.572
Gemini 3.1 Pro42.0%67.0%62.8%50.2%0.698
Qwen3-235B41.0%60.8%56.0%17.3%0.483

Cumulative optimism across models

Optimistic bias across models

Models generally estimate budget too optimistically throughout the rollout. Conservative bias increases with rollout progress, but remains secondary to optimism overall.

First-turn vs. all-turn feasibility F₁

First-turn vs all-turn F1

Points scatter on both sides of the equality line — first-turn predictions do not reliably summarize what the same agent would say after seeing partial progress. On Sokoban, Gemini improves by +21.9 F₁ points; on Search-R1, Qwen3-235B drops 9.3 points.

Optimistic misses dominate at every progress bin for all 20 model-environment pairs — the bias is not an artifact of early turns.
Bias tracks model confidence, not task difficulty. Within an environment, weaker models are more optimistic — consistent with overconfidence under limited self-awareness.
First-turn estimates are unstable. Neither a consistent over- nor underestimate of all-turn judgment — which direction depends on the specific model and task.
Failure recognition timing

Failure is recognized late: across environments, models often label failed trajectories as impossible only after much of the token budget has already been spent.

On failed trajectories, feasibility predictions stay above 70% even after 60% of budget is consumed — the prediction drops sharply only in the final 20%.
This late recognition wastes substantial compute on trajectories that are already doomed — an opportunity for early-stop policies to intervene.

Early Stopping Policy Results

A simple policy: terminate whenever the model predicts impossible. False aborts stop a would-have-succeeded trajectory; false continues miss a chance to save compute.

Model False-Abort Rate Tokens Saved Stopped Failed Rollouts
GPT-5.2 Instant6.6%64.1%124 / 215
Claude Opus 4.72.2%28.2%62 / 169
Claude Sonnet 4.63.3%49.6%101 / 183
Gemini 3.1 Pro2.8%55.7%123 / 221
Qwen3-235B4.9%38.8%140 / 306

SFT + RL training on Sokoban

SFT and RL training tradeoffs

Left: coverage vs. midpoint error for SFT checkpoints and their SFT+RL continuations. Right: reward before and after RL from the same SFT starts. RL improves estimation only when starting from a suitable SFT initialization — RL without SFT collapses.

SFT interval-width ablation

SFT ablation: interval width choices

SFT choices control the estimator's behavior. Wider interval targets improve coverage but increase midpoint error. Longer SFT makes the model more conservative by reducing feasible predictions and improving recall on impossible cases.

Feasibility is a calibration problem. Base Qwen-7B: 25.5% accuracy. SFT alone raises it to ~90%. The capability was already there — the model just needed the right format.
Interval estimation is a reasoning problem. Base: 10.5% coverage. SFT: 26–53% depending on target width. SFT+RL: 47% with 28% median midpoint error. Nearly half of intervals still miss the true remaining budget.
RL without SFT collapses entirely. The model either labels everything impossible or emits invalid formats — RL cannot recover capability from sparse reward on its own.

Citation

If you find this work useful, please cite:

@misc{lin2026bagen,
  title={BAGEN: Are LLM Agents Budget-Aware?},
  author={Yuxiang Lin and Zihan Wang and Mengyang Liu and Yuxuan Shan and Longju Bai and Junyao Zhang and Xing Jin and Boshan Chen and Jinyan Su and Xingyao Wang and Jiaxin Pei and Manling Li},
  year={2026},
  note={NeurIPS 2026},
}