Why AI agent “skills” need real benchmarks, not prompts

In a new post, Rajiv Shah explains why AI agent “skills” only matter if they measurably improve pass/fail outcomes. He outlines a simple eval loop—task, deterministic verifier, and no-skill baseline—plus a tutorial repo showing mixed results across models.

Why AI agent “skills” need real benchmarks, not prompts

TL;DR

  • Skill quality = measurable pass/fail change: Prompt readability insufficient; must improve defined task outcomes
  • Skills can reduce performance: Negative deltas observed (e.g., SkillsBench); impact varies by model and improves over time
  • Evaluation loop: Bounded task, deterministic verifier, and no-skill baseline comparison
  • Secondary metrics: Runtime, event count, tool usage provide context around pass/fail results
  • Tutorial repo: evaluating-skills-tutorial with three deterministic tasks and local verifiers across multiple models
  • Traces for iteration: Laminar example; OpenHands OTEL-compatible for swapping observability stacks

AI agent “skills” are quickly turning into a pragmatic way to steer coding agents toward repeatable workflows—but “How to Evaluate Agent Skills (And Why You Should)” argues that the real work starts after writing them: proving they actually improve outcomes on concrete tasks.

The post’s core point is simple and very developer-friendly: a skill isn’t “good” because it reads nicely in a prompt file. It’s good if it measurably changes pass/fail outcomes on a task with a clear definition of success. That matters because even when a skill looks like “helpful guidance,” it can still reduce performance (including producing negative deltas, as seen in SkillsBench), and its usefulness can fade as models get better—or vary across models entirely.

What “useful evaluation” looks like (and what teams skip)

Rather than treating skills as documentation, the post frames evaluation as a small, repeatable experiment with three ingredients:

  • a bounded task the agent can complete in one run
  • a deterministic verifier that returns pass/fail
  • a no-skill baseline to compare against

That baseline emphasis is the sharp edge here: without running the same task both with and without the skill, it’s hard to tell whether the skill helped, did nothing, or quietly made things worse. The post also calls out secondary metrics—runtime, event count, tool usage—as useful context around pass/fail.

A tutorial repo with three tasks (and three different outcomes)

To make the evaluation loop concrete, the post links a hands-on repo: evaluating-skills-tutorial. It includes three deterministic tasks, paired “no-skill” and “improved-skill” variants, and local verifiers, with runs executed across multiple models (including Claude Sonnet 4.5, Gemini variants, Kimi K2, and MiniMax M2.5).

Crucially, each task tells a different story—ranging from an “essential” procedural skill that flips outcomes dramatically, to a “safety net” skill that mainly improves consistency, to a skill that’s mixed-impact depending on model and backend.

Why traces matter for iteration

Beyond pass/fail, the post also highlights traces as the fastest way to understand why a run succeeded or failed—spotting patterns like meandering exploration in baseline runs or brittle over-constraint in skill-enabled ones. The example uses Laminar, with a note that OpenHands is OTEL-compatible, so the loop isn’t locked to one observability stack.

For the full breakdown (including the task designs, the pass-rate tables, and the evaluation loop), the original post is here: https://www.all-hands.dev/blog/evaluating-agent-skills.

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community