Model Evaluator

The Model Evaluator skill helps you rigorously assess and compare machine learning model performance across multiple dimensions. It guides you through selecting appropriate metrics, designing evaluation protocols, avoiding common statistical pitfalls, and making data-driven decisions about model selection.

Proper model evaluation goes beyond accuracy scores. This skill covers evaluation across the full spectrum: predictive performance, computational efficiency, robustness, fairness, calibration, and production readiness. It helps you answer not just "which model is best?" but "which model is best for my specific use case and constraints?"

Whether you are comparing LLMs, classifiers, or custom models, this skill ensures your evaluation methodology is sound and your conclusions are reliable.

Core Workflows

Workflow 1: Design Evaluation Protocol

Define evaluation objectives:
- Primary goal (accuracy, speed, cost, etc.)
- Secondary constraints
- Failure modes to test
- Real-world conditions to simulate

Select appropriate metrics:

Task Type	Primary Metrics	Secondary Metrics
Classification	Accuracy, F1, AUC-ROC	Precision, Recall, Confusion Matrix
Regression	RMSE, MAE, R-squared	Residual analysis, prediction intervals
Ranking	NDCG, MRR, MAP	Precision@k, Recall@k
Generation	BLEU, ROUGE, BERTScore	Human eval, Faithfulness
LLM	Task-specific accuracy	Latency, cost, consistency

Design test sets:
- Held-out test data
- Edge case collections
- Adversarial examples
- Distribution shift tests
Plan statistical methodology:
- Sample sizes for significance
- Confidence intervals
- Multiple comparison corrections

Workflow 2: Execute Comparative Evaluation

Prepare evaluation infrastructure:

class ModelEvaluator:
    def __init__(self, test_data, metrics):
        self.test_data = test_data
        self.metrics = metrics
        self.results = {}

    def evaluate(self, model, model_name):
        predictions = model.predict(self.test_data.inputs)
        scores = {}
        for metric in self.metrics:
            scores[metric.name] = metric.compute(
                predictions,
                self.test_data.labels
            )
        self.results[model_name] = scores
        return scores

    def compare(self):
        return statistical_comparison(self.results)

Run evaluations consistently across models
Compute confidence intervals
Test for statistical significance
Generate comparison report

Workflow 3: LLM-Specific Evaluation

Define evaluation dimensions:
- Task accuracy (factual, reasoning, coding)
- Response quality (coherence, relevance, style)
- Safety and alignment
- Efficiency (tokens, latency, cost)
Create evaluation datasets:
- Representative prompts
- Ground truth answers (where applicable)
- Human preference data
Implement LLM evaluation:
- Automated metrics (exact match, semantic similarity)
- LLM-as-judge evaluations
- Human evaluation protocols
Analyze results across dimensions
Make recommendations with tradeoffs

Quick Reference

Action	Command/Trigger
Design evaluation	"How should I evaluate [model type]"
Choose metrics	"What metrics for [task type]"
Compare models	"Compare these models: [list]"
LLM evaluation	"Evaluate LLM performance"
Statistical testing	"Is this difference significant"
Bias evaluation	"Check model for bias"

Best Practices

Use Multiple Metrics: No single metric tells the whole story
- Include both aggregate and granular metrics
- Report confidence intervals, not just point estimates
- Show performance across subgroups
Test on Realistic Data: Evaluation data should match production
- Same distribution as real inputs
- Include edge cases and hard examples
- Test on data the model hasn't seen
Account for Variance: Models and data have randomness
- Run multiple seeds for training-based evaluations
- Bootstrap confidence intervals
- Use proper statistical tests for comparison
Consider All Costs: Performance isn't just accuracy
- Inference latency and throughput
- Memory and compute requirements
- API costs for hosted models
- Maintenance and update burden
Test Robustness: How does the model handle adversity?
- Input perturbations and noise
- Distribution shift
- Adversarial examples
- Missing or malformed inputs
Evaluate Fairly: Ensure fair comparison across models
- Same test data for all models
- Consistent preprocessing
- Equivalent hyperparameter tuning effort
- Document any advantages/disadvantages

Advanced Techniques

Multi-Dimensiona

...

model-evaluator

SKILL.md