Five Principles for Writing Industrial-Grade Skills

When most people write their first Skill, they instinctively write it as a longer prompt.

Background, rules, caveats, examples, references — all stuffed into one SKILL.md. It looks complete. It often isn’t.

The point of a Skill isn’t “let the model read more in one shot.” The point is this:

When the user describes a certain kind of task, the agent recognizes the situation, loads the right workflow, picks the right tools, and executes a stable method.

A prompt is a one-off instruction.

A Skill is a reusable, triggerable, maintainable workflow.

It’s not solving the problem of “how do I prompt the model this time.” It’s solving “how should the agent reliably handle this kind of task the next hundred times.”

1. Respect what makes a Skill a Skill: on-demand loading

A Skill is not the same as CLAUDE.md or a system prompt.

CLAUDE.md is persistent context. As long as you’re working in the project, it stays loaded and shapes the model’s behavior.

A slash command is a manual command. The user has to type it; only then does the agent run the workflow.

A Skill sits between the two.

Its defining property is on-demand loading.

The full SKILL.md is not in context by default. It only loads when the user’s input matches the Skill’s description.

Two benefits:

Saves context.
Cuts down noise from rules that don’t apply to the current task.

But here’s the detail that matters:

The full Skill body isn’t persistent, but the description is — it stays in the matching pool the whole time.

So whether the description is well-written directly determines whether the Skill ever fires.

A “cover image” Skill with a description like:

1

Create an image.

probably isn’t enough.

Because what the user actually says is:

Make me a cover image
Add an image to this article
Design a Xiaohongshu cover
Generate an image for this long X post
Make a tech-style poster for this title

Same with a “long-form X post” Skill — the user isn’t going to say “invoke the longform skill.”

They’re more likely to say:

Expand this idea into a long thread
Write me a long X post
Can this be turned into longer content
Rewrite this in my voice
Give me a more attention-grabbing opening

Those real-world phrasings belong in the description or the trigger examples.

Otherwise the Skill exists but never gets called.

There’s another situation: if you have a lot of Skills, or one Skill is unusually long, you can disable it to turn off auto-loading.

The point of disable isn’t just preventing false triggers. The bigger point is saving context.

Auto-triggered Skills require their description to stay in the matching pool. More Skills, more descriptions, higher persistent matching cost.

Some Skills are highly specialized, long, and rarely used. No reason to keep them resident.

For example:

Year-end review Skill: maybe used a few times a year.
Resume rewrite Skill: only during a job search.
Cover image Skill: only when publishing content.
Course notes Skill: only when studying a specific course.

These can be disabled.

Invoke them manually when needed.

This converts the Skill from “an automatic workflow” into “a tool you summon”:

Pro: less context, fewer false fires.
Con: the user has to remember it exists.

So the first decision when writing a Skill isn’t “what rules do I write.” It’s:

Should this auto-trigger, or only fire on demand?

2. Give the Skill the right tool boundary

A Skill can restrict which tools it’s allowed to use.

This isn’t required, but it’s useful.

Different Skills need different permissions.

In Claude Code, you can write something like:

1
2
3
4


allowed-tools:
  - Bash
  - Read
  - Grep

For organizing study material, this might be enough:

1
2
3
4


allowed-tools:
  - Read
  - Grep
  - Write

For document organization:

1
2
3
4


allowed-tools:
  - Read
  - Write
  - Grep

For cover image generation — read references, write a prompt, then call the image tool:

1
2
3
4


allowed-tools:
  - Read
  - Write
  - ImageGenerate

For batch organization, you’d open up more permissions:

1
2
3
4
5


allowed-tools:
  - Read
  - Write
  - Edit
  - Bash

The principle is simple:

Grant only the minimum permissions needed to do the job.

Don’t let a Skill that just generates suggestions modify files by default.

Don’t let a read-only analysis Skill execute arbitrary commands by default.

More tools doesn’t mean more capable. Sometimes it just means more risk.

3. Different Skills can use different models

A Skill can specify a model.

This gets overlooked, but it matters.

Different tasks place different demands on the model.

Writing docs, generating images, analyzing data, scraping, organizing study notes, generating long-form content — these shouldn’t all default to the same model.

A few examples:

Writing docs: needs clear expression and stable structure. Use a model strong at writing and summarization.
Image / design work: needs visual understanding and layout sense. Multimodal or design-oriented models fit better.
Data analysis: needs reliable handling of tables and result interpretation. A solid reasoning model with lower cost works.
Scraping: often batch processing and extraction. Doesn’t need the strongest model — a cheap fast one is better suited.
Study notes: needs categorization, distillation of key points, preservation of meaning. A stable cheap model is fine.
Long-form X posts: needs structure, voice, pacing, judgment about what’s worth saying. Use a model strong at writing.
Resume editing: needs to weigh the job description against what to emphasize. Use a more careful model.

Model selection isn’t about flexing. It’s a balance of cost and reliability.

Some models are stronger but more expensive.

Some are cheaper but adequate for simple tasks.

A mature Skill system shouldn’t default every task to the same model.

It should pick the right execution model based on the task type.

4. Progressive disclosure: keep SKILL.md small

SKILL.md shouldn’t carry everything.

Think of it as the entry point.

It should hold:

When to trigger.
The principles.
The execution steps.
Which tools are needed.
Where the rest of the material lives.
How to validate at the end.

Don’t dump every reference, long example, script, and template into it.

A better approach is layered:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


generate-cover-image/
  SKILL.md
  references/
    cover-image-style-guide.md
    common-platform-sizes.md
    good-vs-bad-examples.md
  scripts/
    check-image-dimensions.py
    export-multi-platform.py
  assets/
    cover-image-template.md
    fonts-and-color-examples.json

Different directories hold different things.

references/ holds long docs, style guides, platform sizes, detailed cases.

scripts/ holds executable code — check image dimensions, batch rename, export multiple platform versions, format tables. Deterministic operations are more reliable as scripts than as model improv every time.

assets/ holds templates, schemas, images, sample files, output formats.

The benefit:

The agent doesn’t load everything every time.

It reads SKILL.md first to understand the overall flow. When it actually needs detail, it reads a reference or runs a script.

That’s progressive disclosure.

Read on demand. Execute on demand.

Don’t shove everything into context up front.

This is also why SKILL.md is best kept under 500 lines.

Not because 500 is magic. Because past that length, you’ve usually mixed too many things together.

If a Skill is too long, three things to try first:

Move long explanations to references/.
Turn stable operations into scripts/.
Move templates and samples into assets/.

If it’s still long after that, you might not have one Skill — you have several.

5. After writing the Skill, you still need to validate, score, and iterate

Writing the Skill doesn’t mean it works.

This part matters.

If you want a Skill to be reliable, run at least three kinds of validation:

Does it run.
Does it trigger correctly.
Are the results actually better than not using the Skill.

The third is the easiest one to skip.

Many Skills just “look done.” But did it raise quality? Did it cut errors? Did it make the output match what the user wanted? Without an eval, nobody actually knows.

Claude Code’s Skill Creator gives a useful framing: don’t ship a Skill on vibes. Set up test data, run an eval, look at the results, iterate.

The core flow:

Define the task the Skill solves.
Write a first draft of the Skill.
Prepare test data / test prompts.
Run the Skill against those tests.
Evaluate each result.
Score each one.
Modify the Skill based on the failures.
Run another round.
Until the main tests hit the minimum acceptable score.

Think of it as writing tests for the Skill.

Not as rigid as unit tests, but the idea is similar: don’t trust the Skill file itself, look at how it performs on sample tasks.

Layer 1: does it run

Test the most basic execution first.

Things like:

Are the file paths right?
Are the tool permissions enough?
Does the script execute?
Can the references be read?
Are the assets templates used correctly?
Does the output match the expected format?
Are any required steps missing?

If a Skill can’t even complete a real task, talking about triggers and quality is pointless.

Layer 2: does it trigger correctly

For auto-loaded Skills, also test triggering.

You can’t just test:

1

Please run the longform X post skill.

Because real users don’t say that.

Prepare a set of phrases real users might actually say:

1
2
3
4
5


Expand this idea into a long thread
Write me a long X post
Can this be turned into longer content
Rewrite this in my voice
Give me a more attention-grabbing opening

For a cover image Skill, you might test:

1
2
3
4


Make me a cover image
Add an image to this article
Make a tech-style poster for this title
Make a Xiaohongshu cover

That’s trigger eval.

Each test should be tagged:

1
2
3
4


[
  {"query": "Expand this idea into a long thread", "should_trigger": true},
  {"query": "Check today's weather", "should_trigger": false}
]

If something that should trigger doesn’t, add keywords to the description.

If something that shouldn’t trigger does, narrow the description.

This step optimizes the entry point of the Skill.

If the entry is wrong, the body doesn’t matter.

Layer 3: prepare test data and eval cases

The bigger task is validating result quality.

You need test data.

That doesn’t mean the user has to hand-craft a pile of test material.

A more reasonable approach: have the agent generate a first batch of test data and eval cases based on the Skill’s purpose, and the user just reviews whether they match real scenarios.

For example:

Cover image Skill: 5 titles, 5 content types, a few reference images.
Long-form X Skill: 5 raw ideas, target audience, ideal final-draft style.
Study notes Skill: a few class notes, excerpts, OCR text from screenshots.
Resume rewrite Skill: different job descriptions, original resume, ideal direction.
Weekly summary Skill: a week of scattered notes, meeting minutes, completed items.

Each test case should ideally include:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


{
  "id": 1,
  "prompt": "How a user would phrase this task",
  "files": ["test input file"],
  "expected_output": "What counts as a good result",
  "assertions": [
    "Must preserve the user's original point",
    "Must produce a strong opening",
    "Must not fabricate experiences the user didn't mention"
  ]
}

The format isn’t the point. The mindset is:

You define “what counts as good” up front.

Otherwise the eval becomes a vibe check after the fact.

Layer 4: score it, don’t just say “good or bad”

After running the test data, don’t just write “results look fine.”

Score each case. Use 0 to 10.

A simple rubric:

0–2: didn’t complete the task, or wrong direction.
3–4: tangentially related, missed key requirements.
5–6: usable but with obvious problems.
7–8: stable quality, minor details to fix.
9–10: matches expectations, can serve as a reference output.

A reasonable bar: main test cases score at least 5.

If an important case scores below 5, the Skill isn’t reliable yet.

For a high-frequency Skill, that usually means it’s broken in a common scenario.

Layer 5: compare against a baseline

If you want to be more rigorous, do a baseline comparison.

Same test, run twice:

Without the Skill.
With the Skill.

Then compare.

Claude Code’s Skill Creator eval includes something like this — A/B benchmark, skill-enabled vs baseline.

Worth doing.

A Skill shouldn’t just “run.” It should prove it’s useful.

If the no-Skill result is already 7 and the Skill version is also 7, the Skill isn’t adding much.

If no-Skill is 4 and with-Skill is 8 — that’s the Skill actually encoding useful experience.

Layer 6: fix the Skill based on the failures

Scoring isn’t decoration.

Scoring tells you what to change.

Common fixes:

Trigger fails: edit the description, add real trigger phrases.
False trigger: narrow the description, add negative cases.
Missing steps: edit the workflow in SKILL.md.
Unstable output format: add an output template to assets/.
Unstable deterministic checks: write a script.
Reference info too long: split it into references/.
Tool permissions too tight: add allowed tools.
Tool permissions too loose: tighten allowed tools.
Model too weak: switch to a better-fit model.

Then run the eval again.

That’s the optimization loop:

1
2
3
4
5
6
7
8


Write Skill
→ Prepare test data
→ Run eval
→ Score each case
→ Find failure causes
→ Modify the Skill
→ Re-run eval
→ Hit the minimum bar, or at least know its limits

For a high-frequency, complex, or shared Skill, run this loop.

Otherwise it’s just an unvalidated prompt file.

What does a good Skill look like?

A good Skill isn’t mystical.

It just has to do a few things:

Trigger when it should.
Stay quiet when it shouldn’t.
Have enough tool permissions, but not too many.
Pick a model that matches the task’s cost and difficulty.
Keep SKILL.md short, with details that load on demand.
For important Skills, have test data, an eval, failure cases, and iteration.

So when writing a Skill, don’t only ask:

“What do I tell the model?”

Also ask:

What user phrases should make this Skill appear?
Once it appears, what’s the workflow?
Which tools is it allowed to use, and which is it not?
Which model does it need?
What stays in the main file, and what gets split out?
How do I prove it actually helps?

That’s the real difference between a Skill and a prompt.

A prompt solves this conversation.

A Skill solves the next hundred similar tasks.

It’s not making the prompt longer.

It’s encoding experience into a triggerable, executable, testable, maintainable workflow.

1. Respect what makes a Skill a Skill: on-demand loading#

2. Give the Skill the right tool boundary#

3. Different Skills can use different models#

4. Progressive disclosure: keep SKILL.md small#

5. After writing the Skill, you still need to validate, score, and iterate#

Layer 1: does it run#

Layer 2: does it trigger correctly#

Layer 3: prepare test data and eval cases#

Layer 4: score it, don’t just say “good or bad”#

Layer 5: compare against a baseline#

Layer 6: fix the Skill based on the failures#

What does a good Skill look like?#

Related Posts

Comments

Stay Updated