Evals
MCP evaluations (evals) provide a way to test MCP server functionality and behavior. Define scenarios in which the server will be used, add in a scorer to grade how well the server was used, and validate that the server works as it should before it’s deployed.
Evals look like tests, but they have more to offer than tests. Tests are solely “defensive,” they help ensure updates don’t break existing functionality. Evals, however, are also “offensive.” They establish a basline to improve upon, allowing MCP server developers to verify that their changes made the server better.
Scenarios
Section titled “Scenarios”Each scenario requires a name and an input (prompt).
You attach a scorer (see scorer section) to the scenario, to define what a successful run looks like.
When a scenario runs, the input is sent to an MCP client that has access to your MCP server. The client will then begin to call tools (or not!) in response to the prompt.
After the run completes, the scorer assigns the run a value between 0 and 1, and you can inspect the result to see what worked (and what didn’t work).
The Runs tab provides access to all previously executed evals. Each scenario run displays:
- Score
- Tools called
- Full transcript of the run
- Cost
Limits
Section titled “Limits”You may execute up to 100 scenario runs per day.
Scorers
Section titled “Scorers”Scorers appear in the Scorers section in the left sidebar.
When no custom scorer exists, scenarios use the default scorer, which simply checks if a tool or list of tools were called during the run.
Custom scorers require a name and optional description. Two scorer types are available:
Code Scorers
Section titled “Code Scorers”Code scorers implement custom JavaScript functions that programmatically evaluate scenario outcomes and return numeric scores between 0 and 1. This approach enables deterministic, rule-based evaluation where correctness is easy to compute.
When to use Code scorers
Section titled “When to use Code scorers”- Exact matching: Verifying specific tool calls were made with correct parameters
- Pattern matching: Using regex or string operations to verify content
- Logical conditions: Implementing if/else rules based on multiple criteria
Function signature
Section titled “Function signature”async ({ input, expected, output }) => { // Your scorer logic here // Return a number between 0 and 1 return 0.5;};Note: console.log() information will be displayed in the scenario run details panel.
Parameters:
Section titled “Parameters:”input: The user’s input/prompt that initiated the scenarioexpected: The expected outcome defined in the scenario configurationoutput: The actual output from the agent/MCP server interaction
The output parameter contains a transcript of what the MCP client did after prompted with the input.
The type definition is reproduced below. Note that this data structure accounts for tools invoked with “code mode”, which is a soon-to-be released feature for running evals.
type ScorerOutput = { text: string; transcript: TranscriptData; toolCalls: EvalToolCall[]; allToolCalls: EvalToolCall[];}
export type TranscriptData = { prompt: string; steps: Array<{ text?: string; finishReason?: string; toolCalls?: Array<{ toolName: string; args: Record<string, unknown>; }>; toolResults?: Array<{ toolName: string; result: unknown; }>; }>; finalText?: string;};
/** * Represents a single tool call recorded during an evaluation run */export interface EvalToolCall { toolName: string; args: unknown; result: unknown; timestamp: number; /** Nested tool calls made within this call (e.g., MCP tools called inside code mode) */ nestedCalls?: EvalToolCall[];}Return value
Section titled “Return value”A code scorer is a function that must return a number between 0 and 1:
1.0= perfect/correct outcome0.0= completely incorrect outcome- Values in between = partial correctness
By default, any score below 0.7 will be considered a “failure”. This will be configurable in the future.
AI-generated code
Section titled “AI-generated code”The code input field includes a magic wand icon in the upper-right corner. This icon triggers AI generation of scorer code from natural language descriptions of the scoring logic. The AI converts evaluation criteria described in plain text into the corresponding JavaScript function.
Example scorer description:
Verify get_weather tool call execution. Score 0 when tool not called. When called, validate latitude parameter range (-90 to 90) and longitude parameter range (-180 to 180). Score 1 when both coordinates valid, otherwise 0.Examples
Section titled “Examples”Example: Validating tool call parameters
async ({ input, output: { text, transcript, toolCalls, allToolCalls }, expected,}) => { // Check if get_weather tool was called const weatherCalls = allToolCalls.filter( (tc) => tc.toolName === "get_weather" );
if (weatherCalls.length === 0) { return 0; }
// Validate coordinates in the first weather call const weatherCall = weatherCalls[0]; const args = weatherCall.args || {};
const latitude = args.latitude; const longitude = args.longitude;
// Check if latitude is a number between -90 and 90 const validLatitude = typeof latitude === "number" && latitude >= -90 && latitude <= 90;
// Check if longitude is a number between -180 and 180 const validLongitude = typeof longitude === "number" && longitude >= -180 && longitude <= 180;
// Return 1 if both are valid, 0 otherwise return validLatitude && validLongitude ? 1 : 0;};LLM-as-judge scoring delegates evaluation to a language model that analyzes conversation transcripts and outputs. This approach handles cases where programmatic evaluation proves insufficient or impractical.
When to use LLM scoring
Section titled “When to use LLM scoring”- Qualitative assessment: Evaluating how well tools were used in tandem to achieve a task
- Behavioral patterns: Detecting whether the agent asks clarifying questions when appropriate (typically based off of tool descriptions)
- Negative cases: Confirming tool calls were correctly avoided when unnecessary
System prompt
Section titled “System prompt”The system prompt defines evaluation criteria and instructs the LLM judge on scenario outcome assessment.
When invoked, the LLM-as-judge will receive the full transcript of the scenario run, including agent responses and tool calls.
Scoring labels
Section titled “Scoring labels”Scoring labels define discrete categories the LLM judge selects from, with each label mapping to a numeric score between 0 and 1. This structured approach mitigates known LLM biases toward specific numbers and ensures consistent evaluation.
Example
Section titled “Example”Scenario: Tool call restraint validation
Validates that MCP server tools remain uncalled when requests lack required parameters.
Test case: “Book me a flight to Paris” (missing: departure city, dates, passenger count) Expected: No booking tools called
System prompt:
Evaluate tool call behavior during the conversation.
Request lacks critical information required for tool execution.Correct behavior: No tools called.Incorrect behavior: Tools called despite insufficient information.
Scoring:- GOOD: Agent correctly refrained- BAD: Agent incorrectly attempted execution with incomplete dataThis prompt could be coupled with labels:
GOOD|1BAD|0