AI agent “skills” are quickly turning into a pragmatic way to steer coding agents toward repeatable workflows—but “How to Evaluate Agent Skills (And Why You Should)” argues that the real work starts after writing them: proving they actually improve outcomes on concrete tasks.
The post’s core point is simple and very developer-friendly: a skill isn’t “good” because it reads nicely in a prompt file. It’s good if it measurably changes pass/fail outcomes on a task with a clear definition of success. That matters because even when a skill looks like “helpful guidance,” it can still reduce performance (including producing negative deltas, as seen in SkillsBench), and its usefulness can fade as models get better—or vary across models entirely.
What “useful evaluation” looks like (and what teams skip)
Rather than treating skills as documentation, the post frames evaluation as a small, repeatable experiment with three ingredients:
- a bounded task the agent can complete in one run
- a deterministic verifier that returns pass/fail
- a no-skill baseline to compare against
That baseline emphasis is the sharp edge here: without running the same task both with and without the skill, it’s hard to tell whether the skill helped, did nothing, or quietly made things worse. The post also calls out secondary metrics—runtime, event count, tool usage—as useful context around pass/fail.
A tutorial repo with three tasks (and three different outcomes)
To make the evaluation loop concrete, the post links a hands-on repo: evaluating-skills-tutorial. It includes three deterministic tasks, paired “no-skill” and “improved-skill” variants, and local verifiers, with runs executed across multiple models (including Claude Sonnet 4.5, Gemini variants, Kimi K2, and MiniMax M2.5).
Crucially, each task tells a different story—ranging from an “essential” procedural skill that flips outcomes dramatically, to a “safety net” skill that mainly improves consistency, to a skill that’s mixed-impact depending on model and backend.
Why traces matter for iteration
Beyond pass/fail, the post also highlights traces as the fastest way to understand why a run succeeded or failed—spotting patterns like meandering exploration in baseline runs or brittle over-constraint in skill-enabled ones. The example uses Laminar, with a note that OpenHands is OTEL-compatible, so the loop isn’t locked to one observability stack.
For the full breakdown (including the task designs, the pass-rate tables, and the evaluation loop), the original post is here: https://www.all-hands.dev/blog/evaluating-agent-skills.