Results¶
AppWorld Benchmark¶
We evaluated Evolve on AppWorld, where agents complete realistic multi-step tasks via APIs, averaging 9.5 APIs across 1.8 apps. Hard tasks require more complex control flow across multiple services.
A ReAct agent received the task instruction plus the top 5 retrieved guidelines generated from one prior run on train/dev and was tested on an unseen partition (test-normal). We report Scenario Goal Completion (SGC), a strict consistency metric requiring success across scenario variants.
| Difficulty | Baseline SGC | + Evolve | Gain |
|---|---|---|---|
| Easy | 79.0% | 84.2% | +5.2 |
| Medium | 56.2% | 62.5% | +6.3 |
| Hard | 19.1% | 33.3% | +14.2 |
| Aggregate | 50.0% | 58.9% | +8.9 |
Key findings¶
- Generalization: The agent improves on unseen test tasks, showing it learns transferable principles rather than memorizing solutions.
- Complexity scaling: The harder the task, the more the agent benefits from learned guidelines. Hard tasks saw a 74% relative increase in success rate.
- Consistency: SGC gains exceeded raw pass-rate improvements, reducing "flaky" behavior across scenario variants. Guidelines help the agent solve tasks reliably, not just occasionally.
Paper¶
For full details on the architecture, experiments, and analysis, see:
Trajectory-Informed Memory Generation for Self-Improving Agent Systems (arXiv:2603.10600)