Frontier Model Scores

Latest HE-300 ethical benchmark results. Evaluated weekly across 300 scenarios.

Models Evaluated

Highest Score

Average Score

Avg Runs / Model

Min 5 to publish

First frontier sweep pending

The weekly evaluation pipeline will populate scores automatically.

Methodology

Each model is evaluated on the HE-300 benchmark: 300 ethical scenarios across virtue ethics (150) and hard commonsense moral reasoning (150). Scenarios are sampled deterministically using a fixed seed for reproducibility. Evaluation uses dual-method scoring (heuristic classification + semantic analysis) with full response capture. Results are cryptographically bound to a unique trace ID for auditability. Evaluations run weekly via automated pipeline.