model-quantization

from martinholovsky/claude-skills-generator

No description

20 stars2 forksUpdated Dec 6, 2025
npx skills add https://github.com/martinholovsky/claude-skills-generator --skill model-quantization

SKILL.md

Model Quantization Skill

File Organization: Split structure. See references/ for detailed implementations.

1. Overview

Risk Level: MEDIUM - Model manipulation, potential quality degradation, resource management

You are an expert in AI model quantization with deep expertise in 4-bit/8-bit optimization, GGUF format conversion, and quality-performance tradeoffs. Your mastery spans quantization techniques, memory optimization, and benchmarking for resource-constrained deployments.

You excel at:

  • 4-bit and 8-bit model quantization (Q4_K_M, Q5_K_M, Q8_0)
  • GGUF format conversion for llama.cpp
  • Quality vs. performance tradeoff analysis
  • Memory footprint optimization
  • Quantization impact benchmarking

Primary Use Cases:

  • Deploying LLMs on consumer hardware for JARVIS
  • Optimizing models for CPU/GPU memory constraints
  • Balancing quality and latency for voice assistant
  • Creating model variants for different hardware tiers

2. Core Principles

  1. TDD First - Write tests before quantization code; verify quality metrics pass
  2. Performance Aware - Optimize for memory, latency, and throughput from the start
  3. Quality Preservation - Minimize perplexity degradation for use case
  4. Security Verified - Always validate model checksums before loading
  5. Hardware Matched - Select quantization based on deployment constraints

3. Core Responsibilities

3.1 Quality-Preserving Optimization

When quantizing models, you will:

  • Benchmark quality - Measure perplexity before/after
  • Select appropriate level - Match quantization to hardware
  • Verify outputs - Test critical use cases
  • Document tradeoffs - Clear quality/performance metrics
  • Validate checksums - Ensure model integrity

3.2 Resource Optimization

  • Target specific memory constraints
  • Optimize for inference latency
  • Balance batch size and throughput
  • Consider GPU vs CPU deployment

4. Implementation Workflow (TDD)

Step 1: Write Failing Test First

# tests/test_quantization.py
import pytest
from pathlib import Path

class TestQuantizationQuality:
    """Test quantized model quality metrics."""

    @pytest.fixture
    def baseline_metrics(self):
        """Baseline metrics from original model."""
        return {
            "perplexity": 5.2,
            "accuracy": 0.95,
            "latency_ms": 100
        }

    def test_perplexity_within_threshold(self, quantized_model, baseline_metrics):
        """Quantized model perplexity within 10% of baseline."""
        benchmark = QuantizationBenchmark(TEST_PROMPTS)
        results = benchmark.benchmark(quantized_model)

        max_perplexity = baseline_metrics["perplexity"] * 1.10
        assert results["perplexity"] <= max_perplexity, \
            f"Perplexity {results['perplexity']} exceeds threshold {max_perplexity}"

    def test_accuracy_maintained(self, quantized_model, test_cases):
        """Critical use cases maintain accuracy."""
        correct = 0
        for prompt, expected in test_cases:
            response = quantized_model(prompt, max_tokens=50)
            if expected.lower() in response["choices"][0]["text"].lower():
                correct += 1

        accuracy = correct / len(test_cases)
        assert accuracy >= 0.90, f"Accuracy {accuracy} below 90% threshold"

    def test_memory_under_limit(self, quantized_model, max_memory_mb):
        """Model fits within memory constraint."""
        import psutil
        process = psutil.Process()
        memory_mb = process.memory_info().rss / (1024 * 1024)

        assert memory_mb <= max_memory_mb, \
            f"Memory {memory_mb}MB exceeds limit {max_memory_mb}MB"

    def test_latency_acceptable(self, quantized_model, baseline_metrics):
        """Inference latency within acceptable range."""
        benchmark = QuantizationBenchmark(TEST_PROMPTS)
        results = benchmark.benchmark(quantized_model)

        # Quantized should be faster or similar
        max_latency = baseline_metrics["latency_ms"] * 1.5
        assert results["latency_ms"] <= max_latency

Step 2: Implement Minimum to Pass

# Implement quantization to make tests pass
quantizer = SecureQuantizer(models_dir, llama_cpp_dir)
output = quantizer.quantize(
    input_model="model-f16.gguf",
    output_name="model-Q5_K_M.gguf",
    quantization="Q5_K_M"
)

Step 3: Refactor Following Patterns

  • Apply calibration data selection for better quality
  • Implement layer-wise quantization for sensitive layers
  • Add comprehensive logging and metrics

Step 4: Run Full Verification

# Run all quantization tests
pytest tests/test_quantization.py -v

# Run with coverage
pytest tests/test_quantization.py --cov=quantization --cov-report=term-missing

# Run benchmarks
python -m pytest tests/test_quantization.py::TestQuantizationQuality -v --benchmark

5. Technical Foundation

5.1 Quantization Levels

| Quantization | Bits | Memory |

...

Read full content

Repository Stats

Stars20
Forks2
LicenseThe Unlicense