promptfoo-evaluation
from daymade/claude-code-skills
Professional Claude Code skills marketplace featuring production-ready skills for enhanced development workflows.
499 stars53 forksUpdated Jan 25, 2026
npx skills add https://github.com/daymade/claude-code-skills --skill promptfoo-evaluationSKILL.md
Promptfoo Evaluation
Overview
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
Quick Start
# Initialize a new evaluation project
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
Configuration Structure
A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertions
Core Configuration (promptfooconfig.yaml)
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"
# Prompts to test
prompts:
- file://prompts/system.md
- file://prompts/chat.json
# Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
label: Claude-4.5-Sonnet
- id: openai:gpt-4.1
label: GPT-4.1
# Test cases
tests: file://tests/cases.yaml
# Default assertions for all tests
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
# Output path
outputPath: results/eval-results.json
Prompt Formats
Text Prompt (system.md)
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
Chat Format (chat.json)
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]
Few-Shot Pattern
Embed examples directly in prompt or use chat format with assistant messages:
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]
Test Cases (tests/cases.yaml)
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8
Python Custom Assertions
Create a Python file for custom assertions (e.g., scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}
Key points:
- Default function name is
get_assert - Specify function with
file://path.py:function_name - Return
bool,float(score), ordictwith pass/score/reason - Access variables via
context['vars']
LLM-as-Judge (llm-rubric)
assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader model
Best practices:
- Provide clear scoring criteria
- Use
thresholdto set minimum passing score - Default grader uses available API keys (OpenAI → Anthropic → Google)
Common Assertion Types
| Type | Usage | Example |
|---|---|---|
contains | Check substring | value: "hello" |
icontains | Case-insensitive | value: "HELLO" |
equals | Exact match | value: "42" |
regex | Pattern match | value: "\\d{4}" |
python | Custom logic | value: file://script.py |
llm-rubric | LLM grading | value: "Is professional" |
latency | Response time | threshold: 1000 |
File References
All paths are relative to config file location:
# Load file content as variable
vars:
content: file://data/input.txt
# Load prompt from file
prompts:
- file://prompts/main.md
# Load test cases from file
tests: file://tests/cases.yaml
# Load Python assertion
assert:
- type: python
value: file://scripts/check.py:validate
Running Evaluations
# Basic run
npx promptfoo@latest eval
# With specific config
npx promptfoo@lat
...
Repository Stats
Stars499
Forks53
LicenseMIT License