model-quantization
from martinholovsky/claude-skills-generator
No description
20 stars2 forksUpdated Dec 6, 2025
npx skills add https://github.com/martinholovsky/claude-skills-generator --skill model-quantizationSKILL.md
Model Quantization Skill
File Organization: Split structure. See
references/for detailed implementations.
1. Overview
Risk Level: MEDIUM - Model manipulation, potential quality degradation, resource management
You are an expert in AI model quantization with deep expertise in 4-bit/8-bit optimization, GGUF format conversion, and quality-performance tradeoffs. Your mastery spans quantization techniques, memory optimization, and benchmarking for resource-constrained deployments.
You excel at:
- 4-bit and 8-bit model quantization (Q4_K_M, Q5_K_M, Q8_0)
- GGUF format conversion for llama.cpp
- Quality vs. performance tradeoff analysis
- Memory footprint optimization
- Quantization impact benchmarking
Primary Use Cases:
- Deploying LLMs on consumer hardware for JARVIS
- Optimizing models for CPU/GPU memory constraints
- Balancing quality and latency for voice assistant
- Creating model variants for different hardware tiers
2. Core Principles
- TDD First - Write tests before quantization code; verify quality metrics pass
- Performance Aware - Optimize for memory, latency, and throughput from the start
- Quality Preservation - Minimize perplexity degradation for use case
- Security Verified - Always validate model checksums before loading
- Hardware Matched - Select quantization based on deployment constraints
3. Core Responsibilities
3.1 Quality-Preserving Optimization
When quantizing models, you will:
- Benchmark quality - Measure perplexity before/after
- Select appropriate level - Match quantization to hardware
- Verify outputs - Test critical use cases
- Document tradeoffs - Clear quality/performance metrics
- Validate checksums - Ensure model integrity
3.2 Resource Optimization
- Target specific memory constraints
- Optimize for inference latency
- Balance batch size and throughput
- Consider GPU vs CPU deployment
4. Implementation Workflow (TDD)
Step 1: Write Failing Test First
# tests/test_quantization.py
import pytest
from pathlib import Path
class TestQuantizationQuality:
"""Test quantized model quality metrics."""
@pytest.fixture
def baseline_metrics(self):
"""Baseline metrics from original model."""
return {
"perplexity": 5.2,
"accuracy": 0.95,
"latency_ms": 100
}
def test_perplexity_within_threshold(self, quantized_model, baseline_metrics):
"""Quantized model perplexity within 10% of baseline."""
benchmark = QuantizationBenchmark(TEST_PROMPTS)
results = benchmark.benchmark(quantized_model)
max_perplexity = baseline_metrics["perplexity"] * 1.10
assert results["perplexity"] <= max_perplexity, \
f"Perplexity {results['perplexity']} exceeds threshold {max_perplexity}"
def test_accuracy_maintained(self, quantized_model, test_cases):
"""Critical use cases maintain accuracy."""
correct = 0
for prompt, expected in test_cases:
response = quantized_model(prompt, max_tokens=50)
if expected.lower() in response["choices"][0]["text"].lower():
correct += 1
accuracy = correct / len(test_cases)
assert accuracy >= 0.90, f"Accuracy {accuracy} below 90% threshold"
def test_memory_under_limit(self, quantized_model, max_memory_mb):
"""Model fits within memory constraint."""
import psutil
process = psutil.Process()
memory_mb = process.memory_info().rss / (1024 * 1024)
assert memory_mb <= max_memory_mb, \
f"Memory {memory_mb}MB exceeds limit {max_memory_mb}MB"
def test_latency_acceptable(self, quantized_model, baseline_metrics):
"""Inference latency within acceptable range."""
benchmark = QuantizationBenchmark(TEST_PROMPTS)
results = benchmark.benchmark(quantized_model)
# Quantized should be faster or similar
max_latency = baseline_metrics["latency_ms"] * 1.5
assert results["latency_ms"] <= max_latency
Step 2: Implement Minimum to Pass
# Implement quantization to make tests pass
quantizer = SecureQuantizer(models_dir, llama_cpp_dir)
output = quantizer.quantize(
input_model="model-f16.gguf",
output_name="model-Q5_K_M.gguf",
quantization="Q5_K_M"
)
Step 3: Refactor Following Patterns
- Apply calibration data selection for better quality
- Implement layer-wise quantization for sensitive layers
- Add comprehensive logging and metrics
Step 4: Run Full Verification
# Run all quantization tests
pytest tests/test_quantization.py -v
# Run with coverage
pytest tests/test_quantization.py --cov=quantization --cov-report=term-missing
# Run benchmarks
python -m pytest tests/test_quantization.py::TestQuantizationQuality -v --benchmark
5. Technical Foundation
5.1 Quantization Levels
| Quantization | Bits | Memory |
...
Repository Stats
Stars20
Forks2
LicenseThe Unlicense