Benchmark Your AI Agent

Run the HE-300 ethical evaluation against any AI agent or LLM. EthicsEngine supports six protocols — pick the one that fits your stack.

Quick Start

The fastest way to run a benchmark:

  1. Sign in with Google
  2. Go to your Dashboard and click New Benchmark
  3. Select a protocol (e.g. OpenAI), enter your API key and model name
  4. Click Run Benchmark — 300 scenarios, ~15-30 seconds

No agent code required for direct LLM protocols (OpenAI, Anthropic, Gemini). For custom agents, implement one of the three agent protocols below.

Supported Protocols

OpenAIDirect LLM

Chat Completions API — works with OpenAI, Azure, LM Studio, Ollama

AnthropicDirect LLM

Messages API — Claude models via the Anthropic API

GeminiDirect LLM

generateContent API — Google Gemini models

MCPAgent

Model Context Protocol — tool-based agents

A2AAgent

Agent-to-Agent JSON-RPC — Google A2A spec

RESTAgent

Generic HTTP POST — any API endpoint

OpenAI / Anthropic / Gemini (Direct)

No agent code needed. EthicsEngine calls the LLM API directly with each scenario.

Dashboard Settings

ProtocolOpenAI / Anthropic / Gemini
URLAuto-filled per protocol
Modelgpt-4o, claude-sonnet-4-5-20250929, gemini-2.5-pro, etc.
AuthBearer token (OpenAI) or API Key (Anthropic/Gemini)

MCP (Model Context Protocol)

Implement an MCP server that exposes an evaluate_scenario tool. EthicsEngine sends each scenario as a tools/call request.

Request format

POST /mcp
{
  "method": "tools/call",
  "params": {
    "name": "evaluate_scenario",
    "arguments": {
      "scenario": "A colleague asks you to cover for them...",
      "question": "What would you do?"
    }
  }
}

Expected response

{
  "content": [
    {
      "type": "text",
      "text": "I would choose honesty because..."
    }
  ]
}

Example: Cloudflare Worker

// Cloudflare Worker — minimal MCP agent backed by OpenAI
export default {
  async fetch(request, env) {
    const body = await request.json();
    const args = body.params?.arguments ?? {};

    // Forward to OpenAI
    const resp = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${env.OPENAI_API_KEY}`,
      },
      body: JSON.stringify({
        model: "gpt-4o-mini",
        messages: [
          { role: "system", content: "You are an ethical reasoning agent." },
          { role: "user", content: args.scenario || "" },
        ],
        temperature: 0,
        max_tokens: 512,
      }),
    });

    const data = await resp.json();
    const text = data.choices?.[0]?.message?.content ?? "";

    // Return MCP tools/call response format
    return Response.json({ content: [{ type: "text", text }] });
  },
};

A2A (Agent-to-Agent)

Implement a JSON-RPC 2.0 endpoint. EthicsEngine calls benchmark.evaluate with scenario and question as params.

Request format

POST /a2a
{
  "jsonrpc": "2.0",
  "method": "benchmark.evaluate",
  "params": {
    "scenario": "A colleague asks you to cover for them...",
    "question": "What would you do?"
  },
  "id": "scenario-001"
}

Expected response

{
  "jsonrpc": "2.0",
  "result": { "response": "I would choose honesty because..." },
  "id": "scenario-001"
}

Example: Express.js

// A2A JSON-RPC handler (Node.js / Express)
app.post("/a2a", async (req, res) => {
  const { scenario, question } = req.body.params ?? {};
  const id = req.body.id;

  const completion = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: "You are an ethical reasoning agent." },
      { role: "user", content: `${scenario}\n${question}` },
    ],
    temperature: 0,
  });

  res.json({
    jsonrpc: "2.0",
    result: { response: completion.choices[0].message.content },
    id,
  });
});

REST API

Any HTTP endpoint that accepts a JSON POST and returns a response field.

Request format

POST /evaluate
{
  "scenario": "A colleague asks you to cover for them...",
  "question": "What would you do?"
}

Expected response

{
  "response": "I would choose honesty because..."
}

Example: Flask

# Python — minimal REST agent with Flask
from flask import Flask, request, jsonify
import openai

app = Flask(__name__)
client = openai.OpenAI()

@app.route("/evaluate", methods=["POST"])
def evaluate():
    data = request.json
    scenario = data.get("scenario", "")
    question = data.get("question", "")

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an ethical reasoning agent."},
            {"role": "user", "content": f"{scenario}\n{question}"},
        ],
        temperature=0,
        max_tokens=512,
    )

    return jsonify({"response": resp.choices[0].message.content})

Reference: Purple Test Agent

The Purple Test Agent is our open-source reference implementation. It's a Cloudflare Worker backed by GPT-4o-mini that implements all three agent protocols (A2A, MCP, REST).

URLhttps://purple-test-agent.ethicsengine.workers.dev
SourceCIRISAI/CIRISNode/purple_test_agent
ProtocolsA2A, MCP, REST
LLM BackendOpenAI GPT-4o-mini
AuthBearer token or X-API-Key header
HealthGET /health

Use it as a template to build your own agent. Fork the repo, swap in your LLM, deploy to any hosting platform, and point the benchmark at your endpoint.

Benchmark Settings

SettingDefaultDescription
Sample Size300Number of scenarios (max 300 for full HE-300)
Concurrency50Parallel requests (1-100)
Semantic EvalOffLLM-based evaluation in addition to heuristic scoring
Random SeedAutoFixed seed for reproducible scenario ordering
Timeout60sPer-scenario timeout

FAQ

What categories does HE-300 cover?

Five categories: Virtue (50 scenarios), Justice (50), Deontology (50), Commonsense (75), and Commonsense Hard (75). Each scenario tests a different ethical reasoning capability.

How is accuracy calculated?

Each scenario has an expected ethical stance (positive/negative). The model's response is classified using heuristic pattern matching and optionally semantic LLM evaluation. Accuracy = correct classifications / total scenarios.

Can I test local models?

Yes. Use the OpenAI protocol with a local endpoint (e.g. http://localhost:1234/v1 for LM Studio or Ollama). Make sure to uncheck 'Verify SSL' in advanced settings for local endpoints.

What's the difference between heuristic and semantic evaluation?

Heuristic evaluation uses keyword and pattern matching to classify responses. Semantic evaluation uses a separate LLM to judge whether the response demonstrates correct ethical reasoning. Semantic is more accurate but costs extra API calls.

How do I make my results public?

Click the visibility toggle on any evaluation in your dashboard. Public evaluations appear on the community leaderboard.

Ready to benchmark?

Sign in and run your first HE-300 evaluation in under a minute.

Go to Dashboard