Skill Judge

Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.

Core Philosophy

What is a Skill?

A Skill is NOT a tutorial. A Skill is a knowledge externalization mechanism.

Traditional AI knowledge is locked in model parameters. To teach new capabilities:

Traditional: Collect data → GPU cluster → Train → Deploy new version
Cost: $10,000 - $1,000,000+
Timeline: Weeks to months

Skills change this:

Skill: Edit SKILL.md → Save → Takes effect on next invocation
Cost: $0
Timeline: Instant

This is the paradigm shift from "training AI" to "educating AI" — like a hot-swappable LoRA adapter that requires no training. You edit a Markdown file in natural language, and the model's behavior changes.

The Core Formula

Good Skill = Expert-only Knowledge − What Claude Already Knows

A Skill's value is measured by its knowledge delta — the gap between what it provides and what the model already knows.

Expert-only knowledge: Decision trees, trade-offs, edge cases, anti-patterns, domain-specific thinking frameworks — things that take years of experience to accumulate
What Claude already knows: Basic concepts, standard library usage, common programming patterns, general best practices

When a Skill explains "what is PDF" or "how to write a for-loop", it's compressing knowledge Claude already has. This is token waste — context window is a public resource shared with system prompts, conversation history, other Skills, and user requests.

Tool vs Skill

Concept	Essence	Function	Example
Tool	What model CAN do	Execute actions	bash, read_file, write_file, WebSearch
Skill	What model KNOWS how to do	Guide decisions	PDF processing, MCP building, frontend design

Tools define capability boundaries — without bash tool, model can't execute commands. Skills inject knowledge — without frontend-design Skill, model produces generic UI.

The equation:

General Agent + Excellent Skill = Domain Expert Agent

Same Claude model, different Skills loaded, becomes different experts.

Three Types of Knowledge in Skills

When evaluating, categorize each section:

Type	Definition	Treatment
Expert	Claude genuinely doesn't know this	Must keep — this is the Skill's value
Activation	Claude knows but may not think of	Keep if brief — serves as reminder
Redundant	Claude definitely knows this	Should delete — wastes tokens

The art of Skill design is maximizing Expert content, using Activation sparingly, and eliminating Redundant ruthlessly.

Evaluation Dimensions (120 points total)

D1: Knowledge Delta (20 points) — THE CORE DIMENSION

The most important dimension. Does the Skill add genuine expert knowledge?

Score	Criteria
0-5	Explains basics Claude knows (what is X, how to write code, standard library tutorials)
6-10	Mixed: some expert knowledge diluted by obvious content
11-15	Mostly expert knowledge with minimal redundancy
16-20	Pure knowledge delta — every paragraph earns its tokens

Red flags (instant score ≤5):

"What is [basic concept]" sections
Step-by-step tutorials for standard operations
Explaining how to use common libraries
Generic best practices ("write clean code", "handle errors")
Definitions of industry-standard terms

Green flags (indicators of high knowledge delta):

Decision trees for non-obvious choices ("when X fails, try Y because Z")
Trade-offs only an expert would know ("A is faster but B handles edge case C")
Edge cases from real-world experience
"NEVER do X because [non-obvious reason]"
Domain-specific thinking frameworks

Evaluation questions:

For each section, ask: "Does Claude already know this?"
If explaining something, ask: "Is this explaining TO Claude or FOR Claude?"
Count paragraphs that are Expert vs Activation vs Redundant

D2: Mindset + Appropriate Procedures (15 points)

Does the Skill transfer expert thinking patterns along with necessary domain-specific procedures?

The difference between experts and novices isn't "knowing how to operate" — it's "how to think about the problem." But thinking patterns alone aren't enough when Claude lacks domain-specific procedural knowledge.

Key distinction:

Type	Example	Value
Thinking patterns	"Before designing, ask: What makes this memorable?"	High — shapes decision-making
Domain-specific procedures	"OOXML workflow: unpack → edit XML → validate → pack"	High — Claude may not know this
Generic procedures	"Step 1: Open file, Step 2: Edit, Step 3: Save"	Low — Claude already knows

Score	Criteria
0-3	Only generic procedures Claude already knows
4-7	Has domain proce

...

skill-judge

SKILL.md