SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Can one-off task experience become reusable instructions that future agents can follow?

Yingtie Lei1,*, Zhongwei Wan1,*, Jiankun Zhang2, Samiul Alam1, Zixuan Zhong3, Peizhou Huang4, Xin Wang1, Jingxuan Zhang1, Donghao Zhou5, Yunta Hsieh4, Zhihao Dou6, Hui Shen4, Yan Xu7, Dimitrios Dimitriadis7, Tuo Zhang7, Mi Zhang1
1 The Ohio State University, 2 The University of Chicago, 3 University College London, 4 University of Michigan, 5 The Chinese University of Hong Kong, 6 Case Western Reserve University, 7 Amazon
* Equal contribution
SkillEvolBench Team
Correspondence: Tuo Zhang tuozhang@amazon.com, Mi Zhang mizhang.1@osu.edu

Abstract

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing transfer under context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces.

Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes. However, these gains are unstable under context shift, adversarial shortcuts, and composition. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. More frequent skill updating is also not monotonically beneficial, as additional revisions can improve coverage while introducing episode-specific drift. These findings highlight SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge.

From Episodes to Skills

Real-world agents leave behind rich trajectories: tool calls, files, tests, failed hypotheses, observations, and verifier feedback. SkillEvolBench asks whether those noisy one-off episodes can be distilled into compact procedural skills that future agents can load, follow, and transfer.

Unlike static question-answering systems, practical agents interact with external environments over multi-step trajectories by reasoning, calling tools, inspecting files, executing code, editing artifacts, and observing feedback. Each task attempt therefore records how the agent acted, where it recovered, where it failed, and which checks exposed the failure. Prior experience-reuse methods show that these traces contain useful evidence, but reusing an episode is not the same as extracting a procedure. A trajectory records what happened once and often entangles transferable decisions with fixture-specific details, failed hypotheses, and mistakes.

SkillEvolBench targets the missing step between episodic memory and procedural reuse. If curated skills demonstrate that procedural knowledge can help agents, and trajectory-reuse methods demonstrate that episodes contain task-solving evidence, the central question is whether agents can convert noisy one-off experience into explicit skills that state what to do again, when to apply it, and what to verify along the way.

SkillEvolBench teaser showing episodic attempts, verifier feedback, skill authoring, and a procedural skill library

Core question. During acquisition, task attempts are compacted into trajectory summaries and paired with structured verifier feedback. A host-side Skill Author writes, refines, or skips a skill update; the resulting library is frozen before harder deployment tasks.

  • Episodes are rich but noisy. Agent trajectories contain useful tool calls, observations, tests, and mistakes, but they also mix transferable decisions with incidental details and failed hypotheses.
  • Experience reuse is not skill formation. Replaying an old trajectory can help with nearby tasks, yet future tasks require explicit guidance about what to do again, when to apply it, and what to verify.
  • Verifier feedback supplies abstraction evidence. Outcome checks, process violations, hidden tests, and rubric diagnoses expose which parts of an episode should become reusable procedure.
  • Frozen deployment separates real transfer from repair. By freezing the skill library before harder related tasks, SkillEvolBench measures whether procedural knowledge was formed before deployment.

Benchmark Design

SkillEvolBench contains 180 tasks across six real-world agent environments. Each environment includes five recurring procedural families, and each family follows a six-role progression from learning episodes to frozen deployment.

The benchmark spans common forms of agent work: code modification, API orchestration, data processing, document transformation, research synthesis, and communication operations. A task family corresponds to a recurring procedural capability rather than a topic label, so families are related enough for experience to matter but different enough to separate procedural learning from memorization.

Each family is instantiated through six roles. The canonical task presents the base procedure, the enriched task exposes a missing sub-capability, and the variant task changes the surface form while preserving the same procedure. The context-shift task embeds the skill need in a broader request, the adversarial task introduces shortcut solutions that can pass shallow checks, and the composition task requires the target skill to interact with other skills. This progression tests transfer, implicit invocation, shortcut resistance, and modular composition within the same family.

180

Tasks

Role-conditioned tasks spanning acquisition, replay, and frozen deployment.

6

Environments

Code, tool/API, data, document, research, and communication workflows.

30

Skill Families

Recurring procedural capabilities that vary surface form and failure mode.

SkillEvolBench taxonomy across environments, skill families, and task roles

Taxonomy. The benchmark organizes agent work into controlled skill-evolution arcs. The first three roles expose and stress-test the target procedure; the last three evaluate transfer, shortcut resistance, and composition.

Learning / Acquisition Tasks The agent observes episodes and may update the family skill.

Canonical
Enriched
Variant

Frozen Deployment Tasks The evolved library is frozen; tasks test transfer, robustness, and composition.

Context Shift
Adversarial
Composition

Procedural families, not topic labels

Each family is defined by a reusable capability such as diagnosis, validation, extraction, reconciliation, or coordination.

Learning roles expose gaps

Canonical, enriched, and variant tasks reveal what an initial procedure covers and where it needs revision.

Deployment roles stress generality

Context shift, adversarial shortcuts, and composition test whether the evolved skill survives realistic pressure.

Skill Evolution Protocol

Each environment is evaluated as an independent lifelong episode: the agent learns within each family, freezes the library for deployment, replays acquisition tasks with the final library, and then resets before switching environments.

Acquisition uses the canonical, enriched, and variant roles. All acquisition tasks in an environment are completed before deployment begins. The active library is scoped to the environment: skills learned from earlier families may be visible to later families in the same environment, but skills never transfer across environments. Each acquisition attempt yields a compacted trajectory summary from harness-recorded artifacts such as instructions, file accesses, tool calls, commands, edits, generated outputs, tests, and final responses; verifier feedback includes outcome results, process checks, rewards, and diagnostics.

Skill authoring is family-local. Although the task-solving agent may read the environment-level library, the Skill Author receives only same-family skills and same-family acquisition history. After acquisition, the library is frozen: deployment tasks may read and apply accumulated skills but may not create, revise, retire, or otherwise modify them. Replay then reruns the original acquisition tasks using the final frozen library, providing a within-environment counterfactual for whether accumulated knowledge helps the tasks it was built from.

Skill evolution protocol showing initialization, acquisition, frozen evaluation, replay, and reset

Protocol. Skill evolution is evaluated under self-generated and curated-start settings against no-skill and raw-trajectory controls. This isolates procedural abstraction from base capability, curated prior knowledge, and direct episodic reuse.

Acquisition creates evidence

Canonical, enriched, and variant tasks produce compacted execution artifacts and verifier feedback for possible skill updates.

Skill updates are externalized

A separated Skill Author updates an environment-scoped library, making the learned procedure inspectable and reusable.

Deployment is frozen

Context-shift, adversarial, and composition tasks test transfer without further updates; replay isolates local recovery.

Experimental Variants and Findings

We evaluate ten model configurations across Claude Code, Codex CLI, and Gemini CLI. The experiments compare skill evolution against no-memory and raw-trajectory controls to separate reusable procedural abstraction from local episode reuse.

The experimental suite uses three agent harnesses under the same benchmark protocol: Claude Code, Codex CLI, and Gemini CLI. Claude Code is evaluated with Opus 4.6, Opus 4.5, Sonnet 4.6, and Sonnet 4.5; Codex CLI is evaluated with GPT-5.4, GPT-5.3-Codex, and GPT-5.2-Codex; Gemini CLI is evaluated with Gemini 3.1 Pro, Gemini 3 Flash, and Gemini 2.5 Pro. Success is measured at acquisition, replay, and frozen deployment, with deployment further decomposed into context shift, adversarial robustness, and composition.

The variants below distinguish several possible sources of improvement. NO-SKILL measures base agent capability without persistent memory. RAW-TRAJECTORY retrieves compacted acquisition traces without inducing skills, testing direct episodic reuse. Curated-start variants measure whether a useful but incomplete human-written skill can be refined from experience. Self-generated variants measure whether agents can induce skill artifacts from their own execution evidence and verifier feedback.

NO-SKILL

No persistent memory or skill library is provided.

RAW-TRAJECTORY

Retrieves compacted same-family acquisition trajectories without inducing procedural skills.

CURATED-STATIC

Starts from a fixed gap-exposed curated skill and never revises it.

CURATED-REVISION

Starts from the curated seed and revises it after failed eligible acquisition attempts.

CURATED-REVISION-ALWAYS

Starts from the curated seed and invokes skill revision after every acquisition attempt.

SELFGEN-ZERO-SHOT

Generates a metadata-only skill before execution; the skill remains fixed.

SELFGEN-REVISION

Induces a skill from canonical trajectory evidence, then revises after failed later acquisition attempts.

SELFGEN-ALWAYS

Updates the self-generated skill after every acquisition attempt.

Curated-start Leaderboard

Fixed curated seeds and revision variants compared across acquisition, replay, and frozen deployment metrics.

Overall: all metrics are reported as percentages and higher is better.
Overall = mean(LSR, RSR, ESR, CSSR, ARSR, CompSR)
Rows without an available RSR value average the remaining reported metrics.
Rank Agent Harness Model Condition Overall LSR RSR ESR CSSR ARSR CompSR

Self-generated Leaderboard

Zero-shot, experience-induced, and always-updated skills compared under the same six reported metrics.

Overall: all metrics are reported as percentages and higher is better.
Overall = mean(LSR, RSR, ESR, CSSR, ARSR, CompSR)
Rows without an available RSR value average the remaining reported metrics.
Rank Agent Harness Model Condition Overall LSR RSR ESR CSSR ARSR CompSR

Overall is the arithmetic mean of the reported percentage metrics. Rows with unavailable RSR values keep RSR blank and average the remaining reported metrics.

LSRLearning Success Rate on acquisition tasks before the library is frozen.
RSRReplay Success Rate on the original acquisition tasks using the final frozen library.
ESRDeployment Success Rate aggregated over all frozen deployment tasks.
CSSRContext-Shift Success Rate, measuring whether the skill is invoked when the need is embedded in a broader request.
ARSRAdversarial Robustness Success Rate, measuring resistance to shortcut solutions that can pass shallow checks.
CompSRComposition Success Rate, measuring whether the target skill combines with other skills in realistic workflows.
Skill-based conditions relative to raw-trajectory controls

Skill-based conditions relative to RAW-TRAJECTORY. Cells report percentage-point differences from raw episodic reuse across LSR, RSR, ESR, CSSR, ARSR, and CompSR. Predominantly negative cells indicate that distilled skills often lose information that raw trajectories still preserve.

Local gains do not imply reusable procedures.

Skill-based conditions sometimes improve LSR or RSR without improving frozen deployment. For example, SELFGEN-REVISION with Claude Opus 4.6 improves LSR by 5.5 pp and RSR by 10.0 pp over NO-SKILL, but decreases ESR, CSSR, and CompSR. These gains look like local recovery rather than robust skill formation.

Deployment axes reveal different failures.

ESR, CSSR, ARSR, and CompSR expose different bottlenecks. GPT-5.4 under CURATED-REVISION-ALWAYS improves ESR, ARSR, and CompSR but drops CSSR, suggesting missed invocation under broader context. Gemini 3 Flash gains on CSSR and CompSR but drops sharply on ARSR, indicating shortcut vulnerability.

Raw trajectories expose an abstraction bottleneck.

The RAW-TRAJECTORY comparison is the clearest stress test for skill abstraction. If distilled skills preserved reusable procedures, they should match or exceed raw episodic traces. Instead, the heatmap is largely negative; GPT-5.4 under SELFGEN-REVISION improves over RAW-TRAJECTORY on LSR and RSR but falls on ESR, CSSR, ARSR, and CompSR.

More updates are not monotonically better.

The ALWAYS policy reveals a coverage-drift trade-off. More frequent updates can improve deployment coverage, as with GPT-5.2-Codex and Gemini 2.5 Pro on some deployment metrics, but it can also reduce local gains or introduce episode-specific clutter. In curated-start settings, ALWAYS does not consistently dominate targeted revision.

Citation

@article{skillevolbench2026,
  title   = {SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills},
  author  = {Author names to be added},
  journal = {arXiv preprint},
  year    = {2026}
}