Evals System Architecture
The Evals system in MCPJam Inspector is a comprehensive testing framework designed to evaluate MCP (Model Context Protocol) server implementations. This guide provides a deep dive into the architecture, data flows, and key components to help you contribute effectively.Overview
The Evals system allows developers to:- Run automated tests against MCP servers to validate tool implementations
- Generate test cases using AI based on available server tools
- Track results in real-time with detailed metrics and analytics
- Compare expected vs actual behavior using agentic LLM loops
Key Features
- Multi-step wizard UI for test configuration
- Support for multiple LLM providers (OpenAI, Anthropic, DeepSeek, Ollama)
- Real-time result tracking via MCPJamBackend
- AI-powered test case generation
- Agentic execution with up to 20 conversation turns
- Token usage and performance metrics
Architecture Overview
The Evals system is composed of three main layers:System Components
1. Client Layer (UI)
EvalRunner Component (client/src/components/evals/eval-runner.tsx)
The primary UI for configuring and launching evaluation runs.
Architecture: 4-Step Wizard
Step Details:
-
Select Servers: Choose from connected MCP servers
- Filters: Only shows connected servers
- Validation: At least one server required
-
Choose Model: Select LLM provider and model
- Providers: OpenAI, Anthropic, DeepSeek, Ollama, MCPJam
- Credential check: Validates API keys via
hasToken()
-
Define Tests: Create or generate test cases
- Manual entry: Title, query, expected tool calls, number of runs
- AI generation: Click “Generate Tests” to create 6 test cases (2 easy, 2 medium, 2 hard)
-
Review & Run: Confirm and execute
- Displays summary of configuration
- POST to
/api/mcp/evals/run
Results Components (client/src/components/evals/*)
Real-time display of evaluation results.
Component Hierarchy:
2. Server Layer (API)
Evals Routes (server/routes/mcp/evals.ts)
HTTP API endpoints for eval execution and test generation.
Endpoint: POST /api/mcp/evals/run
Request Schema:
resolveServerIdsOrThrow(): Case-insensitive server ID matchingrunEvalSuiteWithAiSdk(): Executes eval suite in background using AI SDK
Endpoint: POST /api/mcp/evals/generate-tests
Request Schema:
Test Generation Agent (server/services/eval-agent.ts)
Generates test cases using backend LLM.
Algorithm:
- Groups tools by server ID
- Creates system prompt with MCP agent instructions
- Creates user prompt with tool definitions and requirements
- Calls backend LLM (meta-llama/llama-3.3-70b-instruct)
- Parses JSON response
- Returns 6 test cases (2 easy, 2 medium, 2 hard)
3. CLI Layer (Execution Engine)
Runner (evals-cli/src/evals/runner.ts)
The core orchestrator that executes evaluation tests.
Entry Points:
runEvalsWithApiKey(): CLI mode with API key authenticationrunEvalsWithAuth(): UI mode with Convex authentication
- Max 20 conversation turns to prevent infinite loops
- Token usage tracking (prompt + completion)
- Duration measurement
- Tool call recording
Evaluator (evals-cli/src/evals/evaluator.ts)
Compares expected vs actual tool calls to determine pass/fail status.
Logic:
- ✅ All expected tools must be called
- ⚠️ Additional unexpected tools are allowed (marked but don’t fail)
RunRecorder (evals-cli/src/db/tests.ts)
Database interface for persisting evaluation results.
Two Modes:
- API Key Mode (
createRunRecorder): Uses CLI-based database client - Auth Mode (
createRunRecorderWithAuth): Uses Convex HTTP client
Data Models
Database Schema
TypeScript Interfaces
Integration Points
LLM Providers
The system supports multiple execution paths based on the selected model: Provider Configuration:ai package) for LLM interactions:
generateText(): Single-step text generation with tool callingcreateLlmModel(): Helper to create provider-specific model instances- Automatic tool call extraction and evaluation
- Built-in token usage tracking
MCP Server Integration
Connection Workflow: Transport Support:-
STDIO: Command execution with stdin/stdout
-
HTTP/SSE: Server-Sent Events
-
Streamable HTTP: Custom streaming protocol
MCPJam Backend
Database Actions:Contributing Guide
Adding a New LLM Provider
- Add AI SDK provider package:
- Update model creation in
server/utils/chat-helpers.ts:
- Add to UI model list in
shared/types.ts:
Adding a New MCP Transport
- Update MCPClientManager in
sdk/to support the new transport type:
- Implement transport connection logic in MCPClientManager:
- Ensure tool execution works with the new transport in
getToolsForAiSdk()
Debugging Evals
Enable verbose logging:Testing Changes
Test via UI:- Start development server:
npm run dev - Navigate to “Run evals” tab
- Configure and execute test
- Check browser console for errors
- View results in “Eval results” tab
- Monitor server logs for execution details
Common Issues
Issue: Test cases are not created- Check Convex auth token validity
- Verify
CONVEX_URLandCONVEX_HTTP_URLenvironment variables - Inspect browser network tab for failed requests
- Verify server connection status in ClientManager
- Check tool definitions in
listTools()response - Ensure tool names match exactly (case-sensitive)
- Confirm
/streamingendpoint is accessible - Check Convex auth token in request headers
- Verify model ID format (
@mcpjam/...)
Performance Considerations
Optimization Strategies
-
Parallel Execution: Run multiple test cases concurrently
-
Tool Batching: Execute independent tools in parallel
-
Database Batching: Batch iteration updates
-
Caching: Cache tool definitions between iterations
Metrics
Key performance indicators:- Average iteration duration: Time from start to finish
- Token usage per iteration: Prompt + completion tokens
- Tool execution time: Time spent in MCP calls
- Database write time: Time to persist results
- LLM response time: Time for each model call
helpers.ts aggregation functions.
Security Considerations
API Key Management
- Never commit API keys to version control
- Store keys in localStorage (client) or environment variables (CLI)
- Use Convex auth tokens for backend models (no API key exposure)
Input Validation
All inputs are validated with Zod schemas:Error Handling
- Never expose internal errors to the client
- Sanitize error messages before logging
- Catch all exceptions in async functions
- Validate all external inputs (LLM responses, tool results)
Future Enhancements
Potential areas for contribution:- Parallel Test Execution: Run multiple test cases simultaneously
- Custom Evaluators: Support for user-defined pass/fail criteria
- Retry Logic: Automatic retry on transient failures
- Result Comparison: Compare results across different models
- Historical Analysis: Trend analysis of eval performance over time
- Export Results: Download results as CSV/JSON
- Shareable Suites: Share test configurations with team members
- Scheduling: Run evals on a schedule (cron-like)
Glossary
| Term | Definition |
|---|---|
| Eval Suite | A collection of test cases executed together |
| Test Case | A single test with a query and expected tool calls |
| Iteration | One execution of a test case (test cases can have multiple runs) |
| Agentic Loop | Iterative LLM conversation with tool calling |
| Tool Call | Invocation of an MCP server tool by the LLM |
| Expected Tools | Tools that should be called for a test to pass |
| Actual Tools | Tools that were actually called during execution |
| Missing Tools | Expected tools that were not called (causes failure) |
| Unexpected Tools | Tools called but not expected (logged, doesn’t fail) |
| RunRecorder | Interface for persisting eval results to database |
| MCPClientManager | Manager for MCP server connections and tool execution |
| AI SDK | Vercel’s AI SDK for LLM interactions and tool calling |
Resources
- MCP Specification: https://spec.modelcontextprotocol.io
- Vercel AI SDK: https://sdk.vercel.ai
- Convex Database: https://convex.dev
Questions?
If you have questions or need help contributing:- Check the GitHub Issues
- Join our Discord community
- Read the main Contributing Guide

