bedrock-agentcore-evaluations
from adaptationio/skrillz
No description
npx skills add https://github.com/adaptationio/skrillz --skill bedrock-agentcore-evaluationsSKILL.md
Amazon Bedrock AgentCore Evaluations
Overview
AgentCore Evaluations transforms agent testing from "vibes-based" to metric-based quality assurance. Test agents before production, then continuously monitor live interactions using 13 built-in evaluators and custom scoring systems.
Purpose: Ensure AI agents meet quality, safety, and effectiveness standards
Pattern: Task-based (5 operations)
Key Principles (validated by AWS December 2025):
- Pre-Production Testing - Validate before deployment
- Continuous Monitoring - Sample and score live interactions
- 13 Built-in Evaluators - Standard quality dimensions
- Custom Evaluators - LLM-as-Judge for domain-specific metrics
- Alerting Integration - CloudWatch for proactive monitoring
- On-Demand + Continuous - Both testing modes supported
Quality Targets:
- Correctness: ≥90% accuracy
- Helpfulness: ≥85% satisfaction
- Safety: 0 harmful outputs
- Goal Success: ≥80% completion
When to Use
Use bedrock-agentcore-evaluations when:
- Testing agents before production deployment
- Monitoring production agent quality continuously
- Setting up quality alerts and dashboards
- Validating tool selection accuracy
- Measuring goal completion rates
- Creating domain-specific quality metrics
When NOT to Use:
- Policy enforcement (use bedrock-agentcore-policy)
- Content filtering (use Bedrock Guardrails)
- Unit testing code (use pytest/jest)
Prerequisites
Required
- Deployed AgentCore agent or test data
- IAM permissions for evaluation operations
- CloudWatch for monitoring integration
Recommended
- Test scenarios documented
- Baseline metrics established
- Alert thresholds defined
The 13 Built-in Evaluators
| # | Evaluator | Purpose | Score Range |
|---|---|---|---|
| 1 | Correctness | Factual accuracy of responses | 0-1 |
| 2 | Helpfulness | Value and usefulness to user | 0-1 |
| 3 | Tool Selection Accuracy | Did agent call correct tool? | 0-1 |
| 4 | Tool Parameter Accuracy | Were tool arguments correct? | 0-1 |
| 5 | Safety | Detection of harmful content | 0-1 |
| 6 | Faithfulness | Grounded in source context | 0-1 |
| 7 | Goal Success Rate | User intent satisfied | 0-1 |
| 8 | Context Relevance | On-topic responses | 0-1 |
| 9 | Coherence | Logical flow | 0-1 |
| 10 | Conciseness | Brevity and efficiency | 0-1 |
| 11 | Stereotype Harm | Bias detection | 0-1 (lower=better) |
| 12 | Maliciousness | Intent to harm | 0-1 (lower=better) |
| 13 | Self-Harm | Self-harm content detection | 0-1 (lower=better) |
Operations
Operation 1: Create Evaluators
Time: 5-10 minutes Automation: 90% Purpose: Configure built-in evaluators for your agent
Create Built-in Evaluator:
import boto3
control = boto3.client('bedrock-agentcore-control')
# Create correctness evaluator
response = control.create_evaluator(
name='correctness-evaluator',
description='Evaluates factual accuracy of agent responses',
evaluatorType='BUILT_IN',
builtInConfig={
'evaluatorName': 'CORRECTNESS',
'scoringThreshold': 0.8 # Flag if below 80%
}
)
correctness_evaluator_id = response['evaluatorId']
# Create safety evaluator
response = control.create_evaluator(
name='safety-evaluator',
description='Detects harmful or unsafe content',
evaluatorType='BUILT_IN',
builtInConfig={
'evaluatorName': 'SAFETY',
'scoringThreshold': 0.95 # Must be 95%+ safe
}
)
safety_evaluator_id = response['evaluatorId']
# Create tool selection evaluator
response = control.create_evaluator(
name='tool-selection-evaluator',
description='Validates correct tool selection',
evaluatorType='BUILT_IN',
builtInConfig={
'evaluatorName': 'TOOL_SELECTION_ACCURACY',
'scoringThreshold': 0.9
}
)
tool_evaluator_id = response['evaluatorId']
Create All Standard Evaluators:
built_in_evaluators = [
('CORRECTNESS', 0.8),
('HELPFULNESS', 0.85),
('TOOL_SELECTION_ACCURACY', 0.9),
('TOOL_PARAMETER_ACCURACY', 0.9),
('SAFETY', 0.95),
('FAITHFULNESS', 0.8),
('GOAL_SUCCESS_RATE', 0.8),
('CONTEXT_RELEVANCE', 0.85),
('COHERENCE', 0.85),
('CONCISENESS', 0.7)
]
evaluator_ids = []
for evaluator_name, threshold in built_in_evaluators:
response = control.create_evaluator(
name=f'{evaluator_name.lower().replace("_", "-")}-evaluator',
description=f'Built-in {evaluator_name} evaluator',
evaluatorType='BUILT_IN',
builtInConfig={
'evaluatorName': evaluator_name,
'scoringThreshold': threshold
}
)
evaluator_ids.append(response['evaluatorId'])
Operation 2: Custom LLM-as-Judge Evaluators
Time: 10-15 minutes Automation: 80% Purpose: Create domain-specific quality metrics
Custom Evaluator for Brand Tone:
...