Benchmark Your AI Agent
Run the HE-300 ethical evaluation against any AI agent or LLM. EthicsEngine supports six protocols — pick the one that fits your stack.
Quick Start
The fastest way to run a benchmark:
- Sign in with Google
- Go to your Dashboard and click New Benchmark
- Select a protocol (e.g. OpenAI), enter your API key and model name
- Click Run Benchmark — 300 scenarios, ~15-30 seconds
No agent code required for direct LLM protocols (OpenAI, Anthropic, Gemini). For custom agents, implement one of the three agent protocols below.
Supported Protocols
Chat Completions API — works with OpenAI, Azure, LM Studio, Ollama
Messages API — Claude models via the Anthropic API
generateContent API — Google Gemini models
Model Context Protocol — tool-based agents
Agent-to-Agent JSON-RPC — Google A2A spec
Generic HTTP POST — any API endpoint
OpenAI / Anthropic / Gemini (Direct)
No agent code needed. EthicsEngine calls the LLM API directly with each scenario.
Dashboard Settings
| Protocol | OpenAI / Anthropic / Gemini |
| URL | Auto-filled per protocol |
| Model | gpt-4o, claude-sonnet-4-5-20250929, gemini-2.5-pro, etc. |
| Auth | Bearer token (OpenAI) or API Key (Anthropic/Gemini) |
MCP (Model Context Protocol)
Implement an MCP server that exposes an evaluate_scenario tool. EthicsEngine sends each scenario as a tools/call request.
Request format
POST /mcp
{
"method": "tools/call",
"params": {
"name": "evaluate_scenario",
"arguments": {
"scenario": "A colleague asks you to cover for them...",
"question": "What would you do?"
}
}
}Expected response
{
"content": [
{
"type": "text",
"text": "I would choose honesty because..."
}
]
}Example: Cloudflare Worker
// Cloudflare Worker — minimal MCP agent backed by OpenAI
export default {
async fetch(request, env) {
const body = await request.json();
const args = body.params?.arguments ?? {};
// Forward to OpenAI
const resp = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${env.OPENAI_API_KEY}`,
},
body: JSON.stringify({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "You are an ethical reasoning agent." },
{ role: "user", content: args.scenario || "" },
],
temperature: 0,
max_tokens: 512,
}),
});
const data = await resp.json();
const text = data.choices?.[0]?.message?.content ?? "";
// Return MCP tools/call response format
return Response.json({ content: [{ type: "text", text }] });
},
};A2A (Agent-to-Agent)
Implement a JSON-RPC 2.0 endpoint. EthicsEngine calls benchmark.evaluate with scenario and question as params.
Request format
POST /a2a
{
"jsonrpc": "2.0",
"method": "benchmark.evaluate",
"params": {
"scenario": "A colleague asks you to cover for them...",
"question": "What would you do?"
},
"id": "scenario-001"
}Expected response
{
"jsonrpc": "2.0",
"result": { "response": "I would choose honesty because..." },
"id": "scenario-001"
}Example: Express.js
// A2A JSON-RPC handler (Node.js / Express)
app.post("/a2a", async (req, res) => {
const { scenario, question } = req.body.params ?? {};
const id = req.body.id;
const completion = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "You are an ethical reasoning agent." },
{ role: "user", content: `${scenario}\n${question}` },
],
temperature: 0,
});
res.json({
jsonrpc: "2.0",
result: { response: completion.choices[0].message.content },
id,
});
});REST API
Any HTTP endpoint that accepts a JSON POST and returns a response field.
Request format
POST /evaluate
{
"scenario": "A colleague asks you to cover for them...",
"question": "What would you do?"
}Expected response
{
"response": "I would choose honesty because..."
}Example: Flask
# Python — minimal REST agent with Flask
from flask import Flask, request, jsonify
import openai
app = Flask(__name__)
client = openai.OpenAI()
@app.route("/evaluate", methods=["POST"])
def evaluate():
data = request.json
scenario = data.get("scenario", "")
question = data.get("question", "")
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are an ethical reasoning agent."},
{"role": "user", "content": f"{scenario}\n{question}"},
],
temperature=0,
max_tokens=512,
)
return jsonify({"response": resp.choices[0].message.content})Reference: Purple Test Agent
The Purple Test Agent is our open-source reference implementation. It's a Cloudflare Worker backed by GPT-4o-mini that implements all three agent protocols (A2A, MCP, REST).
| URL | https://purple-test-agent.ethicsengine.workers.dev |
| Source | CIRISAI/CIRISNode/purple_test_agent |
| Protocols | A2A, MCP, REST |
| LLM Backend | OpenAI GPT-4o-mini |
| Auth | Bearer token or X-API-Key header |
| Health | GET /health |
Use it as a template to build your own agent. Fork the repo, swap in your LLM, deploy to any hosting platform, and point the benchmark at your endpoint.
Benchmark Settings
| Setting | Default | Description |
|---|---|---|
| Sample Size | 300 | Number of scenarios (max 300 for full HE-300) |
| Concurrency | 50 | Parallel requests (1-100) |
| Semantic Eval | Off | LLM-based evaluation in addition to heuristic scoring |
| Random Seed | Auto | Fixed seed for reproducible scenario ordering |
| Timeout | 60s | Per-scenario timeout |
FAQ
What categories does HE-300 cover?
Five categories: Virtue (50 scenarios), Justice (50), Deontology (50), Commonsense (75), and Commonsense Hard (75). Each scenario tests a different ethical reasoning capability.
How is accuracy calculated?
Each scenario has an expected ethical stance (positive/negative). The model's response is classified using heuristic pattern matching and optionally semantic LLM evaluation. Accuracy = correct classifications / total scenarios.
Can I test local models?
Yes. Use the OpenAI protocol with a local endpoint (e.g. http://localhost:1234/v1 for LM Studio or Ollama). Make sure to uncheck 'Verify SSL' in advanced settings for local endpoints.
What's the difference between heuristic and semantic evaluation?
Heuristic evaluation uses keyword and pattern matching to classify responses. Semantic evaluation uses a separate LLM to judge whether the response demonstrates correct ethical reasoning. Semantic is more accurate but costs extra API calls.
How do I make my results public?
Click the visibility toggle on any evaluation in your dashboard. Public evaluations appear on the community leaderboard.