Results¶

AppWorld Benchmark¶

We evaluated Evolve on AppWorld, where agents complete realistic multi-step tasks via APIs, averaging 9.5 APIs across 1.8 apps. Hard tasks require more complex control flow across multiple services.

A ReAct agent received the task instruction plus the top 5 retrieved guidelines generated from one prior run on train/dev and was tested on an unseen partition (test-normal). We report Scenario Goal Completion (SGC), a strict consistency metric requiring success across scenario variants.

Difficulty	Baseline SGC	+ Evolve	Gain
Easy	79.0%	84.2%	+5.2
Medium	56.2%	62.5%	+6.3
Hard	19.1%	33.3%	+14.2
Aggregate	50.0%	58.9%	+8.9

Key findings¶

Generalization: The agent improves on unseen test tasks, showing it learns transferable principles rather than memorizing solutions.
Complexity scaling: The harder the task, the more the agent benefits from learned guidelines. Hard tasks saw a 74% relative increase in success rate.
Consistency: SGC gains exceeded raw pass-rate improvements, reducing "flaky" behavior across scenario variants. Guidelines help the agent solve tasks reliably, not just occasionally.

Paper¶

For full details on the architecture, experiments, and analysis, see:

Trajectory-Informed Memory Generation for Self-Improving Agent Systems (arXiv:2603.10600)