Amazon Bedrock AgentCore Evaluations

Overview

AgentCore Evaluations transforms agent testing from "vibes-based" to metric-based quality assurance. Test agents before production, then continuously monitor live interactions using 13 built-in evaluators and custom scoring systems.

Purpose: Ensure AI agents meet quality, safety, and effectiveness standards

Pattern: Task-based (5 operations)

Key Principles (validated by AWS December 2025):

Pre-Production Testing - Validate before deployment
Continuous Monitoring - Sample and score live interactions
13 Built-in Evaluators - Standard quality dimensions
Custom Evaluators - LLM-as-Judge for domain-specific metrics
Alerting Integration - CloudWatch for proactive monitoring
On-Demand + Continuous - Both testing modes supported

Quality Targets:

Correctness: ≥90% accuracy
Helpfulness: ≥85% satisfaction
Safety: 0 harmful outputs
Goal Success: ≥80% completion

When to Use

Use bedrock-agentcore-evaluations when:

Testing agents before production deployment
Monitoring production agent quality continuously
Setting up quality alerts and dashboards
Validating tool selection accuracy
Measuring goal completion rates
Creating domain-specific quality metrics

When NOT to Use:

Policy enforcement (use bedrock-agentcore-policy)
Content filtering (use Bedrock Guardrails)
Unit testing code (use pytest/jest)

Prerequisites

Required

Deployed AgentCore agent or test data
IAM permissions for evaluation operations
CloudWatch for monitoring integration

The 13 Built-in Evaluators

#	Evaluator	Purpose	Score Range
1	Correctness	Factual accuracy of responses	0-1
2	Helpfulness	Value and usefulness to user	0-1
3	Tool Selection Accuracy	Did agent call correct tool?	0-1
4	Tool Parameter Accuracy	Were tool arguments correct?	0-1
5	Safety	Detection of harmful content	0-1
6	Faithfulness	Grounded in source context	0-1
7	Goal Success Rate	User intent satisfied	0-1
8	Context Relevance	On-topic responses	0-1
9	Coherence	Logical flow	0-1
10	Conciseness	Brevity and efficiency	0-1
11	Stereotype Harm	Bias detection	0-1 (lower=better)
12	Maliciousness	Intent to harm	0-1 (lower=better)
13	Self-Harm	Self-harm content detection	0-1 (lower=better)

Operations

Operation 1: Create Evaluators

Time: 5-10 minutes Automation: 90% Purpose: Configure built-in evaluators for your agent

Create Built-in Evaluator:

import boto3

control = boto3.client('bedrock-agentcore-control')

# Create correctness evaluator
response = control.create_evaluator(
    name='correctness-evaluator',
    description='Evaluates factual accuracy of agent responses',
    evaluatorType='BUILT_IN',
    builtInConfig={
        'evaluatorName': 'CORRECTNESS',
        'scoringThreshold': 0.8  # Flag if below 80%
    }
)
correctness_evaluator_id = response['evaluatorId']

# Create safety evaluator
response = control.create_evaluator(
    name='safety-evaluator',
    description='Detects harmful or unsafe content',
    evaluatorType='BUILT_IN',
    builtInConfig={
        'evaluatorName': 'SAFETY',
        'scoringThreshold': 0.95  # Must be 95%+ safe
    }
)
safety_evaluator_id = response['evaluatorId']

# Create tool selection evaluator
response = control.create_evaluator(
    name='tool-selection-evaluator',
    description='Validates correct tool selection',
    evaluatorType='BUILT_IN',
    builtInConfig={
        'evaluatorName': 'TOOL_SELECTION_ACCURACY',
        'scoringThreshold': 0.9
    }
)
tool_evaluator_id = response['evaluatorId']

Create All Standard Evaluators:

built_in_evaluators = [
    ('CORRECTNESS', 0.8),
    ('HELPFULNESS', 0.85),
    ('TOOL_SELECTION_ACCURACY', 0.9),
    ('TOOL_PARAMETER_ACCURACY', 0.9),
    ('SAFETY', 0.95),
    ('FAITHFULNESS', 0.8),
    ('GOAL_SUCCESS_RATE', 0.8),
    ('CONTEXT_RELEVANCE', 0.85),
    ('COHERENCE', 0.85),
    ('CONCISENESS', 0.7)
]

evaluator_ids = []
for evaluator_name, threshold in built_in_evaluators:
    response = control.create_evaluator(
        name=f'{evaluator_name.lower().replace("_", "-")}-evaluator',
        description=f'Built-in {evaluator_name} evaluator',
        evaluatorType='BUILT_IN',
        builtInConfig={
            'evaluatorName': evaluator_name,
            'scoringThreshold': threshold
        }
    )
    evaluator_ids.append(response['evaluatorId'])

Operation 2: Custom LLM-as-Judge Evaluators

Time: 10-15 minutes Automation: 80% Purpose: Create domain-specific quality metrics

Custom Evaluator for Brand Tone:

...

bedrock-agentcore-evaluations

SKILL.md