Skip to main content
← Back to sources

Cognition Launches FrontierCode, Benchmarking Whether AI Code Would Actually Be Merged

Published 2026-06-09AI Engineering PracticesHigh⭐ Timeline Candidate

Summary

Cognition (maker of Devin) introduced FrontierCode, a coding benchmark — explicitly modeled on FrontierMath — that measures whether AI-generated code would actually be *merged into production*, not merely whether it passes unit tests. Each task required 40+ hours of work by open-source maintainers and scores submissions across regression safety, cleanliness, scope discipline, test correctness, and maintainability. The headline result: the top model (Opus 4.8) reached only ~13% on the hardest pro

Radar Context

Devin
Alignment: Reinforces current position
Related Positions: agentic-workflows, ai-assisted-development-tooling, ai-governance-and-risk
Related Partnerships: Cognition (Windsurf / Devin)
cognitionfrontiercodecoding-benchmarkcode-qualityevaluationagentic-codingswe-benchmerge-readiness
Cognition Launches FrontierCode, Benchmarking Whether AI Code Would Actually Be Merged — Intelligence — Agentic Developer Tools Radar · Signal