Curated-start Leaderboard
Fixed curated seeds and revision variants compared across acquisition, replay, and frozen deployment metrics.
| Rank | Agent Harness | Model | Condition | Overall | LSR | RSR | ESR | CSSR | ARSR | CompSR |
|---|
Can one-off task experience become reusable instructions that future agents can follow?
tuozhang@amazon.com, Mi Zhang mizhang.1@osu.edu
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing transfer under context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces.
Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes. However, these gains are unstable under context shift, adversarial shortcuts, and composition. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. More frequent skill updating is also not monotonically beneficial, as additional revisions can improve coverage while introducing episode-specific drift. These findings highlight SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge.
Real-world agents leave behind rich trajectories: tool calls, files, tests, failed hypotheses, observations, and verifier feedback. SkillEvolBench asks whether those noisy one-off episodes can be distilled into compact procedural skills that future agents can load, follow, and transfer.
Unlike static question-answering systems, practical agents interact with external environments over multi-step trajectories by reasoning, calling tools, inspecting files, executing code, editing artifacts, and observing feedback. Each task attempt therefore records how the agent acted, where it recovered, where it failed, and which checks exposed the failure. Prior experience-reuse methods show that these traces contain useful evidence, but reusing an episode is not the same as extracting a procedure. A trajectory records what happened once and often entangles transferable decisions with fixture-specific details, failed hypotheses, and mistakes.
SkillEvolBench targets the missing step between episodic memory and procedural reuse. If curated skills demonstrate that procedural knowledge can help agents, and trajectory-reuse methods demonstrate that episodes contain task-solving evidence, the central question is whether agents can convert noisy one-off experience into explicit skills that state what to do again, when to apply it, and what to verify along the way.
Core question. During acquisition, task attempts are compacted into trajectory summaries and paired with structured verifier feedback. A host-side Skill Author writes, refines, or skips a skill update; the resulting library is frozen before harder deployment tasks.
SkillEvolBench contains 180 tasks across six real-world agent environments. Each environment includes five recurring procedural families, and each family follows a six-role progression from learning episodes to frozen deployment.
The benchmark spans common forms of agent work: code modification, API orchestration, data processing, document transformation, research synthesis, and communication operations. A task family corresponds to a recurring procedural capability rather than a topic label, so families are related enough for experience to matter but different enough to separate procedural learning from memorization.
Each family is instantiated through six roles. The canonical task presents the base procedure, the enriched task exposes a missing sub-capability, and the variant task changes the surface form while preserving the same procedure. The context-shift task embeds the skill need in a broader request, the adversarial task introduces shortcut solutions that can pass shallow checks, and the composition task requires the target skill to interact with other skills. This progression tests transfer, implicit invocation, shortcut resistance, and modular composition within the same family.
Role-conditioned tasks spanning acquisition, replay, and frozen deployment.
Code, tool/API, data, document, research, and communication workflows.
Recurring procedural capabilities that vary surface form and failure mode.
Taxonomy. The benchmark organizes agent work into controlled skill-evolution arcs. The first three roles expose and stress-test the target procedure; the last three evaluate transfer, shortcut resistance, and composition.
Each family is defined by a reusable capability such as diagnosis, validation, extraction, reconciliation, or coordination.
Canonical, enriched, and variant tasks reveal what an initial procedure covers and where it needs revision.
Context shift, adversarial shortcuts, and composition test whether the evolved skill survives realistic pressure.
Each environment is evaluated as an independent lifelong episode: the agent learns within each family, freezes the library for deployment, replays acquisition tasks with the final library, and then resets before switching environments.
Acquisition uses the canonical, enriched, and variant roles. All acquisition tasks in an environment are completed before deployment begins. The active library is scoped to the environment: skills learned from earlier families may be visible to later families in the same environment, but skills never transfer across environments. Each acquisition attempt yields a compacted trajectory summary from harness-recorded artifacts such as instructions, file accesses, tool calls, commands, edits, generated outputs, tests, and final responses; verifier feedback includes outcome results, process checks, rewards, and diagnostics.
Skill authoring is family-local. Although the task-solving agent may read the environment-level library, the Skill Author receives only same-family skills and same-family acquisition history. After acquisition, the library is frozen: deployment tasks may read and apply accumulated skills but may not create, revise, retire, or otherwise modify them. Replay then reruns the original acquisition tasks using the final frozen library, providing a within-environment counterfactual for whether accumulated knowledge helps the tasks it was built from.
Protocol. Skill evolution is evaluated under self-generated and curated-start settings against no-skill and raw-trajectory controls. This isolates procedural abstraction from base capability, curated prior knowledge, and direct episodic reuse.
Canonical, enriched, and variant tasks produce compacted execution artifacts and verifier feedback for possible skill updates.
A separated Skill Author updates an environment-scoped library, making the learned procedure inspectable and reusable.
Context-shift, adversarial, and composition tasks test transfer without further updates; replay isolates local recovery.
We evaluate ten model configurations across Claude Code, Codex CLI, and Gemini CLI. The experiments compare skill evolution against no-memory and raw-trajectory controls to separate reusable procedural abstraction from local episode reuse.
The experimental suite uses three agent harnesses under the same benchmark protocol: Claude Code, Codex CLI, and Gemini CLI. Claude Code is evaluated with Opus 4.6, Opus 4.5, Sonnet 4.6, and Sonnet 4.5; Codex CLI is evaluated with GPT-5.4, GPT-5.3-Codex, and GPT-5.2-Codex; Gemini CLI is evaluated with Gemini 3.1 Pro, Gemini 3 Flash, and Gemini 2.5 Pro. Success is measured at acquisition, replay, and frozen deployment, with deployment further decomposed into context shift, adversarial robustness, and composition.
The variants below distinguish several possible sources of improvement. NO-SKILL measures base agent capability without persistent memory. RAW-TRAJECTORY retrieves compacted acquisition traces without inducing skills, testing direct episodic reuse. Curated-start variants measure whether a useful but incomplete human-written skill can be refined from experience. Self-generated variants measure whether agents can induce skill artifacts from their own execution evidence and verifier feedback.
No persistent memory or skill library is provided.
Retrieves compacted same-family acquisition trajectories without inducing procedural skills.
Starts from a fixed gap-exposed curated skill and never revises it.
Starts from the curated seed and revises it after failed eligible acquisition attempts.
Starts from the curated seed and invokes skill revision after every acquisition attempt.
Generates a metadata-only skill before execution; the skill remains fixed.
Induces a skill from canonical trajectory evidence, then revises after failed later acquisition attempts.
Updates the self-generated skill after every acquisition attempt.
Fixed curated seeds and revision variants compared across acquisition, replay, and frozen deployment metrics.
| Rank | Agent Harness | Model | Condition | Overall | LSR | RSR | ESR | CSSR | ARSR | CompSR |
|---|
Zero-shot, experience-induced, and always-updated skills compared under the same six reported metrics.
| Rank | Agent Harness | Model | Condition | Overall | LSR | RSR | ESR | CSSR | ARSR | CompSR |
|---|
Overall is the arithmetic mean of the reported percentage metrics. Rows with unavailable RSR values keep RSR blank and average the remaining reported metrics.
Skill-based conditions relative to RAW-TRAJECTORY. Cells report percentage-point differences from raw episodic reuse across LSR, RSR, ESR, CSSR, ARSR, and CompSR. Predominantly negative cells indicate that distilled skills often lose information that raw trajectories still preserve.
Skill-based conditions sometimes improve LSR or RSR without improving frozen deployment. For example, SELFGEN-REVISION with Claude Opus 4.6 improves LSR by 5.5 pp and RSR by 10.0 pp over NO-SKILL, but decreases ESR, CSSR, and CompSR. These gains look like local recovery rather than robust skill formation.
ESR, CSSR, ARSR, and CompSR expose different bottlenecks. GPT-5.4 under CURATED-REVISION-ALWAYS improves ESR, ARSR, and CompSR but drops CSSR, suggesting missed invocation under broader context. Gemini 3 Flash gains on CSSR and CompSR but drops sharply on ARSR, indicating shortcut vulnerability.
The RAW-TRAJECTORY comparison is the clearest stress test for skill abstraction. If distilled skills preserved reusable procedures, they should match or exceed raw episodic traces. Instead, the heatmap is largely negative; GPT-5.4 under SELFGEN-REVISION improves over RAW-TRAJECTORY on LSR and RSR but falls on ESR, CSSR, ARSR, and CompSR.
The ALWAYS policy reveals a coverage-drift trade-off. More frequent updates can improve deployment coverage, as with GPT-5.2-Codex and Gemini 2.5 Pro on some deployment metrics, but it can also reduce local gains or introduce episode-specific clutter. In curated-start settings, ALWAYS does not consistently dominate targeted revision.
@article{skillevolbench2026,
title = {SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills},
author = {Author names to be added},
journal = {arXiv preprint},
year = {2026}
}