scorable-integration

from root-signals/scorable-skills

Skills for using the Scorable evaluation platform

1 stars0 forksUpdated Jan 21, 2026
npx skills add https://github.com/root-signals/scorable-skills --skill scorable-integration

SKILL.md

Scorable Integration

Guides integration of Scorable's LLM-as-a-Judge evaluation system into codebases with LLM interactions. Scorable creates custom evaluators (judges) that assess LLM outputs for quality, safety, and policy adherence.

Overview

Your role is to:

  1. Analyze the codebase to identify LLM interactions
  2. Create judges via Scorable API to evaluate those interactions
  3. Integrate judge execution into the code at appropriate points
  4. Provide usage documentation for the evaluation setup

Workflow

Step 1: Analyze the Application

Examine the codebase to understand:

  • What LLM interactions exist (prompts, completions, agent calls)
  • What the application does at each interaction point
  • Which interactions are most critical to evaluate

If multiple LLM interactions exist, help the user prioritize. Recommend starting with the most critical one first.

Step 2: Get Scorable API Key

Ask the user if they want to:

  • Create a temporary API key (for quick testing) - Warn that the judge will be public and visible to everyone
  • Use an existing API key (for production)

Creating a Temporary API Key

curl --request POST \
  --url https://api.scorable.ai/create-demo-user/ \
  --header 'accept: application/json' \
  --header 'content-type: application/json'

Response includes api_key field. Warn the user appropriately that:

  • The judge will be public and visible to everyone
  • The key only works for a limited time
  • For private judges, they should create a permanent key at https://scorable.ai/register

Step 3: Generate a Judge

Call /v1/judges/generate/ with a detailed intent string describing what to evaluate.

Intent String Guidelines

  • Describe the application context and what you're evaluating
  • Mention the specific execution point (stage name)
  • Include critical quality dimensions you care about
  • Add examples, documentation links, or policies if relevant
  • Be specific and detailed (multiple sentences/paragraphs are good)
  • Code level details like frameworks, libraries, etc. do not need to be mentioned

Example Request

curl --request POST \
  --url https://api.scorable.ai/v1/judges/generate/ \
  --header 'accept: application/json' \
  --header 'content-type: application/json' \
  --header 'Authorization: Api-Key <SCORABLE_API_KEY>' \
  --data '{
    "visibility": "unlisted",
    "intent": "An email automation system that creates summary emails using an LLM based on database query results and user input. Evaluate the LLM output for: accuracy in summarizing data, appropriate tone for the audience, inclusion of all key information from queries, proper formatting, and absence of hallucinations. The system is used for customer-facing communications.",
    "generating_model_params": {
      "temperature": 0.2,
      "reasoning_effort": "medium"
    }
  }'

Note: This can take up to 2 minutes to complete.

Handling API Responses

1. missing_context_from_system_goal - Additional context needed:

{
  "missing_context_from_system_goal": [
    {
      "form_field_name": "target_audience",
      "form_field_description": "The intended audience for the content"
    }
  ]
}

Ask the user for these details (if not evident from the codebase), then call /v1/judges/generate/ again with:

{
  "judge_id": "existing-judge-id",
  "stage": "Stage name",
  "extra_contexts": {
    "target_audience": "Enterprise customers"
  }
}

2. multiple_stages - Judge detected multiple evaluation points:

{
  "error_code": "multiple_stages",
  "stages": ["Stage 1", "Stage 2", "Stage 3"]
}

Ask the user which stage to focus on, or if they have a custom stage name. Each judge evaluates one stage. You can create additional judges later for other stages.

3. Success - Judge created:

{
  "judge_id": "abc123...",
  "evaluator_details": [...]
}

Proceed to integration.

Step 4: Integrate Judge Execution

Add code to evaluate LLM outputs at the appropriate execution point(s).

Check if the codebase is using a framework with integration instructions in Scorable docs (use curl to fetch https://docs.scorable.ai/llms.txt if needed).

Language-Specific Integration

Choose the appropriate integration guide based on the codebase language:

Integration Points

  • Insert evaluation code where LLM outputs are generated (e.g., after OpenAI response calls)
  • response parameter: The text you want to evaluate
  • request parameter: The input that prompted the response
  • messages parameter: The multi-turn

...

Read full content

Repository Stats

Stars1
Forks0