bedrock-agentcore-evaluations

from adaptationio/skrillz

No description

1 stars0 forksUpdated Jan 16, 2026
npx skills add https://github.com/adaptationio/skrillz --skill bedrock-agentcore-evaluations

SKILL.md

Amazon Bedrock AgentCore Evaluations

Overview

AgentCore Evaluations transforms agent testing from "vibes-based" to metric-based quality assurance. Test agents before production, then continuously monitor live interactions using 13 built-in evaluators and custom scoring systems.

Purpose: Ensure AI agents meet quality, safety, and effectiveness standards

Pattern: Task-based (5 operations)

Key Principles (validated by AWS December 2025):

  1. Pre-Production Testing - Validate before deployment
  2. Continuous Monitoring - Sample and score live interactions
  3. 13 Built-in Evaluators - Standard quality dimensions
  4. Custom Evaluators - LLM-as-Judge for domain-specific metrics
  5. Alerting Integration - CloudWatch for proactive monitoring
  6. On-Demand + Continuous - Both testing modes supported

Quality Targets:

  • Correctness: ≥90% accuracy
  • Helpfulness: ≥85% satisfaction
  • Safety: 0 harmful outputs
  • Goal Success: ≥80% completion

When to Use

Use bedrock-agentcore-evaluations when:

  • Testing agents before production deployment
  • Monitoring production agent quality continuously
  • Setting up quality alerts and dashboards
  • Validating tool selection accuracy
  • Measuring goal completion rates
  • Creating domain-specific quality metrics

When NOT to Use:

  • Policy enforcement (use bedrock-agentcore-policy)
  • Content filtering (use Bedrock Guardrails)
  • Unit testing code (use pytest/jest)

Prerequisites

Required

  • Deployed AgentCore agent or test data
  • IAM permissions for evaluation operations
  • CloudWatch for monitoring integration

Recommended

  • Test scenarios documented
  • Baseline metrics established
  • Alert thresholds defined

The 13 Built-in Evaluators

#EvaluatorPurposeScore Range
1CorrectnessFactual accuracy of responses0-1
2HelpfulnessValue and usefulness to user0-1
3Tool Selection AccuracyDid agent call correct tool?0-1
4Tool Parameter AccuracyWere tool arguments correct?0-1
5SafetyDetection of harmful content0-1
6FaithfulnessGrounded in source context0-1
7Goal Success RateUser intent satisfied0-1
8Context RelevanceOn-topic responses0-1
9CoherenceLogical flow0-1
10ConcisenessBrevity and efficiency0-1
11Stereotype HarmBias detection0-1 (lower=better)
12MaliciousnessIntent to harm0-1 (lower=better)
13Self-HarmSelf-harm content detection0-1 (lower=better)

Operations

Operation 1: Create Evaluators

Time: 5-10 minutes Automation: 90% Purpose: Configure built-in evaluators for your agent

Create Built-in Evaluator:

import boto3

control = boto3.client('bedrock-agentcore-control')

# Create correctness evaluator
response = control.create_evaluator(
    name='correctness-evaluator',
    description='Evaluates factual accuracy of agent responses',
    evaluatorType='BUILT_IN',
    builtInConfig={
        'evaluatorName': 'CORRECTNESS',
        'scoringThreshold': 0.8  # Flag if below 80%
    }
)
correctness_evaluator_id = response['evaluatorId']

# Create safety evaluator
response = control.create_evaluator(
    name='safety-evaluator',
    description='Detects harmful or unsafe content',
    evaluatorType='BUILT_IN',
    builtInConfig={
        'evaluatorName': 'SAFETY',
        'scoringThreshold': 0.95  # Must be 95%+ safe
    }
)
safety_evaluator_id = response['evaluatorId']

# Create tool selection evaluator
response = control.create_evaluator(
    name='tool-selection-evaluator',
    description='Validates correct tool selection',
    evaluatorType='BUILT_IN',
    builtInConfig={
        'evaluatorName': 'TOOL_SELECTION_ACCURACY',
        'scoringThreshold': 0.9
    }
)
tool_evaluator_id = response['evaluatorId']

Create All Standard Evaluators:

built_in_evaluators = [
    ('CORRECTNESS', 0.8),
    ('HELPFULNESS', 0.85),
    ('TOOL_SELECTION_ACCURACY', 0.9),
    ('TOOL_PARAMETER_ACCURACY', 0.9),
    ('SAFETY', 0.95),
    ('FAITHFULNESS', 0.8),
    ('GOAL_SUCCESS_RATE', 0.8),
    ('CONTEXT_RELEVANCE', 0.85),
    ('COHERENCE', 0.85),
    ('CONCISENESS', 0.7)
]

evaluator_ids = []
for evaluator_name, threshold in built_in_evaluators:
    response = control.create_evaluator(
        name=f'{evaluator_name.lower().replace("_", "-")}-evaluator',
        description=f'Built-in {evaluator_name} evaluator',
        evaluatorType='BUILT_IN',
        builtInConfig={
            'evaluatorName': evaluator_name,
            'scoringThreshold': threshold
        }
    )
    evaluator_ids.append(response['evaluatorId'])

Operation 2: Custom LLM-as-Judge Evaluators

Time: 10-15 minutes Automation: 80% Purpose: Create domain-specific quality metrics

Custom Evaluator for Brand Tone:


...
Read full content

Repository Stats

Stars1
Forks0