ACES v2 — Effective April 2026 · Anchors recalibrated June 2026
A systematic framework for assessing agentic developer tools across six dimensions of enterprise relevance.
ACES v2 is the evaluation methodology powering Agentic Tools Radar — every tracked tool is scored on six dimensions using a 5-band comparative rubric, then assigned a signal level based on the quality and recency of supporting evidence.
Each of the six dimensions is scored on a 1–20 scale and averaged to produce a 0–100 rating. A signal level — Validated, Assessed, Tracked, or Detected — reflects the quality of supporting evidence. An evidence grade (A–D) summarizes recency, depth, and hands-on testing.
Each tool is assessed across six equally-weighted dimensions. Scores run 1–20 across five comparative bands.
| Dimension | What It Measures |
|---|---|
| Autonomy | Ability to plan and execute tasks independently |
| Integration | VCS, CI/CD, project management, and team connectivity |
| Context | Depth of codebase and ecosystem understanding |
| Compliance | Enterprise governance: certs, access controls, audit, deployment |
| Viability | Vendor sustainability and developer experience |
| Interface | Interaction surface breadth and maturity |
Each dimension uses the same five comparative bands on the 1–20 scale. Bands are defined relative to the category cohort, not by counting features: the middle band must be earned by clearing named downward triggers, and the top band requires hands-on or independent third-party evidence (vendor-only claims cap a dimension at 16). Detailed per-dimension anchors live in the scoring guide.
Capability essentially missing, or an active disqualifier for this dimension
Present but below what a serious tool in this category is expected to offer; named gaps pull a tool down here
Meets the expected baseline for a credible tool in its cohort — not a default; a tool must clear the Below-Baseline downward triggers to be here
Demonstrably better than the cohort median on this dimension, with specific differentiators named relative to ≥2 named competitors
Sets the bar the category measures against — evidence-gated: requires hands-on or independent third-party evidence; vendor-only claims cap this dimension at 16
The Rating is the pure capability score (0–100) — what the tool can do when working as intended. It is an internal sort key. What you see in the UI is the tier derived from that score.
Formula:
Rating = avg(Autonomy, Integration, Context, Compliance, Viability, Interface) × 5Each dimension is scored 1–20. The average across 6 dimensions is multiplied by 5 to produce a 0–100 scale (where 20 × 5 = 100). In practice, no tool scores 20 in every dimension, so the practical ceiling for well-rounded tools is in the mid-to-high 70s.
Example Calculation:
Autonomy: 12, Integration: 15, Context: 16, Compliance: 12, Viability: 15, Interface: 17
Average = (12 + 15 + 16 + 12 + 15 + 17) ÷ 6 = 87 ÷ 6 = 14.5
Rating = 14.5 × 5 = 72.5 → displayed as 73 → tier: Proven
Rating Tiers
The numeric rating is an internal sort key; the tier is the primary label shown in the UI. Scores are currently research-grade for most tools — treat a Tracked or Detected tool's tier as directional until hands-on testing improves evidence quality.
Top-performing tools with strong capabilities across most dimensions
Solid, enterprise-suitable tools with well-rounded scores
Capable in specific areas; not yet broad enough for all enterprise contexts
Early-stage, limited scope, or constrained by active score caps
Signal levels describe how much evidence supports a tool's rating. A high-rated tool at Detected carries more risk than a moderate-rated tool at Validated. Signal level is independent of rating.
Production-validated in enterprise environments. Multiple independent deployments confirmed.
Evidence required: Named enterprise customers, production metrics, or independent audits. No critical open blockers.
Active evaluation with substantial evidence. Hands-on testing completed or underway.
Evidence required: Internal evaluation data, multiple independent sources, documented trial findings.
Known tool being monitored. Research-based assessment without hands-on testing.
Evidence required: Official documentation, community consensus, analyst coverage. No direct testing.
Recently identified. Minimal evaluation completed; scored from public signals only.
Evidence required: Vendor documentation, initial public signals. Evaluation has not yet commenced.
Each evaluation carries an evidence grade (A–D) derived from three factors. The grade signals how much to trust the score — stronger evidence means less interpretation needed when comparing tools.
| Grade | Evidence Score | Profile |
|---|---|---|
| A | ≥ 0.75 | Recent (<30d), thorough evaluation, hands-on tested |
| B | 0.50–0.74 | Recent or thorough, with partial hands-on evidence |
| C | 0.25–0.49 | Moderate age or depth; limited or no hands-on testing |
| D | < 0.25 | Stale (>90d), minimal depth, no hands-on testing |
Evidence Factors:
14 score caps limit dimension scores when specific conditions are documented. Caps are grouped across security, capability, enterprise, trust, and stability categories. Temporary caps are removed when the triggering condition is resolved; permanent caps reflect fundamental design limitations.
| Cap | Trigger | Impact |
|---|---|---|
| Critical Security Vuln | Unpatched critical CVE or active security incident | Compliance ≤ 5 |
| No Codebase Indexing | No codebase indexing or semantic search capability | Context ≤ 11 |
| Single IDE Only | Only works in one IDE with no CLI or web option | Interface ≤ 10 |
| No Automation Mode | No CLI, API, or headless mode for CI/CD integration | Interface ≤ 12 |
| No Enterprise Features | No enterprise customers/features (SSO, RBAC, audit, compliance certs) | All dims ≤ 14 |
| Pricing Opacity | No public pricing or highly opaque pricing model | Compliance ≤ 15 |
| Pricing Volatility | Frequent pricing changes causing budget unpredictability | Compliance ≤ 12 |
| Severe Negative Sentiment | Widespread negative sentiment (sentiment score 1–2) | Status impact only |
| Reliability Complaints | Widespread reliability complaints (breaks often, unreliable output) | Autonomy ≤ 12 |
| Unvalidated Benchmarks | Vendor-only benchmark claims with no independent validation | Autonomy ≤ 14 |
| Community Exodus | Documented user exodus or mass migration away from the tool | All dims ≤ 12 |
| Stalled Development | No releases or meaningful updates in 90+ days | All dims ≤ 12 |
| Acquisition Uncertainty | Acquisition with unclear product roadmap | Compliance ≤ 12 |
| Funding Concerns | Funding or runway concerns | Compliance ≤ 10 |
Four capability caps were removed in June 2026 (no tools used them): manual-acceptance-required, no-git-integration, file-level-only, small-context-window. Cap count: 18 → 14.
A standardized 60-minute first-contact testing protocol is under development. It will cover installation, a defined task scenario, error recovery, and a structured exit-interview rubric. Currently, most evaluations are research-based with hands-on testing conducted opportunistically. When formal DX testing is completed, it will unlock Grade A evidence and improve signal-level confidence.
The same tool presents different risk profiles depending on use case. A tool assessed as Validated for frontend prototyping carries different implications than one writing data migrations or touching production infrastructure. Scores reflect general enterprise-engineering capability; teams should layer their own use-case risk assessment on top of signal levels.
Evaluations are published on a monthly release cycle. The AI engineering landscape moves too fast for quarterly cycles — significant security incidents, pricing changes, and major releases can shift a tool's position within days.
Core framework effective April 2026 (ACES v2): 6-dimension model, three-signal model (Rating / Signal Level / Evidence Grade). Phase A (June 2026): the 0–100 composite rating is presented as a 4-tier band (Leading / Proven / Emerging / Watch) — numeric score retained as an internal sort key. Phase B (June 2026): recalibrated the per-dimension rubric anchors to be comparative, forced-distribution, and evidence-gated, and consolidated the 7-level rubric to 5 bands — dimension count unchanged at 6.