N-Day-Bench Benchmark Evaluates LLM Ability to Discover Real-World Vulnerabilities

Published 2026-04-16AI Engineering PracticesMedium

Summary

N-Day-Bench is an adaptive benchmark designed to measure the capability of frontier language models to discover real-world software vulnerabilities (known as "N-days") that were disclosed after each model's knowledge cutoff date. The benchmark provides all models with an identical harness and context, aiming to prevent reward hacking and produce a fair comparison of genuine cybersecurity vulnerability discovery capabilities. Models are evaluated via OpenRouter-backed infrastructure, and the test

Alignment: New signal not yet covered

Related Positions: ai-governance-and-risk.md, ai-assisted-development-tooling.md, agentic-workflows.md

vulnerability-discoveryllm-benchmarkscybersecurityn-day-vulnerabilitiescode-securityai-safetymodel-evaluationoffensive-securityai-governancesecurity-tooling