ACES v2 — Effective April 2026 · Anchors recalibrated June 2026

Evaluation Methodology

A systematic framework for assessing agentic developer tools across six dimensions of enterprise relevance.

The ACES Framework

ACES v2 is the evaluation methodology powering Agentic Tools Radar — every tracked tool is scored on six dimensions using a 5-band comparative rubric, then assigned a signal level based on the quality and recency of supporting evidence.

Each of the six dimensions is scored on a 1–20 scale and averaged to produce a 0–100 rating. A signal level — Validated, Assessed, Tracked, or Detected — reflects the quality of supporting evidence. An evidence grade (A–D) summarizes recency, depth, and hands-on testing.

Six Evaluation Dimensions

Each tool is assessed across six equally-weighted dimensions. Scores run 1–20 across five comparative bands.

Dimension	What It Measures	Key Questions
Autonomy	Ability to plan and execute tasks independently	Can it self-correct? Execute multi-step workflows? Operate in background?
Integration	VCS, CI/CD, project management, and team connectivity	Git/GitHub depth? PR workflows? CI/CD hooks? Issue tracking?
Context	Depth of codebase and ecosystem understanding	Full repo indexing? Cross-repo awareness? Long-context retention?
Compliance	Enterprise governance: certs, access controls, audit, deployment	SOC 2 Type II? SSO/SAML? Audit logs? Data residency?
Viability	Vendor sustainability and developer experience	Pricing transparency? Financial health? Community? DX quality?
Interface	Interaction surface breadth and maturity	IDE breadth? CLI/terminal? Web UI? API/headless?

Scoring Scale — 5-Band Rubric

Each dimension uses the same five comparative bands on the 1–20 scale. Bands are defined relative to the category cohort, not by counting features: the middle band must be earned by clearing named downward triggers, and the top band requires hands-on or independent third-party evidence (vendor-only claims cap a dimension at 16). Detailed per-dimension anchors live in the scoring guide.

1–5Absent / Disqualifying

Capability essentially missing, or an active disqualifier for this dimension

6–10Below Baseline

Present but below what a serious tool in this category is expected to offer; named gaps pull a tool down here

11–13Meets Baseline

Meets the expected baseline for a credible tool in its cohort — not a default; a tool must clear the Below-Baseline downward triggers to be here

14–16Above Cohort

Demonstrably better than the cohort median on this dimension, with specific differentiators named relative to ≥2 named competitors

17–20Best-in-Class

Sets the bar the category measures against — evidence-gated: requires hands-on or independent third-party evidence; vendor-only claims cap this dimension at 16

Rating & Tiers

The Rating is the pure capability score (0–100) — what the tool can do when working as intended. It is an internal sort key. What you see in the UI is the tier derived from that score.

Formula:

Rating = avg(Autonomy, Integration, Context, Compliance, Viability, Interface) × 5

Each dimension is scored 1–20. The average across 6 dimensions is multiplied by 5 to produce a 0–100 scale (where 20 × 5 = 100). In practice, no tool scores 20 in every dimension, so the practical ceiling for well-rounded tools is in the mid-to-high 70s.

Example Calculation:

Autonomy: 12, Integration: 15, Context: 16, Compliance: 12, Viability: 15, Interface: 17
Average = (12 + 15 + 16 + 12 + 15 + 17) ÷ 6 = 87 ÷ 6 = 14.5
Rating = 14.5 × 5 = 72.5 → displayed as 73 → tier: Proven

Rating Tiers

The numeric rating is an internal sort key; the tier is the primary label shown in the UI. Scores are currently research-grade for most tools — treat a Tracked or Detected tool's tier as directional until hands-on testing improves evidence quality.

Leading≥ 76

Top-performing tools with strong capabilities across most dimensions

Proven64–75

Solid, enterprise-suitable tools with well-rounded scores

Emerging50–63

Capable in specific areas; not yet broad enough for all enterprise contexts

Watch< 50

Early-stage, limited scope, or constrained by active score caps

Signal Levels

Signal levels describe how much evidence supports a tool's rating. A high-rated tool at Detected carries more risk than a moderate-rated tool at Validated. Signal level is independent of rating.

Validated

Production-validated in enterprise environments. Multiple independent deployments confirmed.

Evidence required: Named enterprise customers, production metrics, or independent audits. No critical open blockers.

Assessed

Active evaluation with substantial evidence. Hands-on testing completed or underway.

Evidence required: Internal evaluation data, multiple independent sources, documented trial findings.

Tracked

Known tool being monitored. Research-based assessment without hands-on testing.

Evidence required: Official documentation, community consensus, analyst coverage. No direct testing.

Detected

Recently identified. Minimal evaluation completed; scored from public signals only.

Evidence required: Vendor documentation, initial public signals. Evaluation has not yet commenced.

Evidence Grades

Each evaluation carries an evidence grade (A–D) derived from three factors. The grade signals how much to trust the score — stronger evidence means less interpretation needed when comparing tools.

Grade	Evidence Score	Profile
A	≥ 0.75	Recent (<30d), thorough evaluation, hands-on tested
B	0.50–0.74	Recent or thorough, with partial hands-on evidence
C	0.25–0.49	Moderate age or depth; limited or no hands-on testing
D	< 0.25	Stale (>90d), minimal depth, no hands-on testing

Evidence Factors:

Recency (30%)
<30 days = 1.0, 30–90 days = 0.5, >90 days = 0.0

Evaluation Depth (35%)
Thorough = 1.0, Moderate = 0.5, Minimal = 0.0

Hands-On Testing (35%)
Tested = 1.0, Demo = 0.5, Not Tested = 0.0

Score Caps

14 score caps limit dimension scores when specific conditions are documented. Caps are grouped across security, capability, enterprise, trust, and stability categories. Temporary caps are removed when the triggering condition is resolved; permanent caps reflect fundamental design limitations.

Cap	Trigger	Impact	Type
Critical Security Vuln	Unpatched critical CVE or active security incident	Compliance ≤ 5	Temporary
No Codebase Indexing	No codebase indexing or semantic search capability	Context ≤ 11	Conditional
Single IDE Only	Only works in one IDE with no CLI or web option	Interface ≤ 10	Permanent
No Automation Mode	No CLI, API, or headless mode for CI/CD integration	Interface ≤ 12	Conditional
No Enterprise Features	No enterprise customers/features (SSO, RBAC, audit, compliance certs)	All dims ≤ 14	Conditional
Pricing Opacity	No public pricing or highly opaque pricing model	Compliance ≤ 15	Conditional
Pricing Volatility	Frequent pricing changes causing budget unpredictability	Compliance ≤ 12	Temporary
Severe Negative Sentiment	Widespread negative sentiment (sentiment score 1–2)	Status impact only	Temporary
Reliability Complaints	Widespread reliability complaints (breaks often, unreliable output)	Autonomy ≤ 12	Temporary
Unvalidated Benchmarks	Vendor-only benchmark claims with no independent validation	Autonomy ≤ 14	Conditional
Community Exodus	Documented user exodus or mass migration away from the tool	All dims ≤ 12	Temporary
Stalled Development	No releases or meaningful updates in 90+ days	All dims ≤ 12	Temporary
Acquisition Uncertainty	Acquisition with unclear product roadmap	Compliance ≤ 12	Temporary
Funding Concerns	Funding or runway concerns	Compliance ≤ 10	Temporary

Four capability caps were removed in June 2026 (no tools used them): manual-acceptance-required, no-git-integration, file-level-only, small-context-window. Cap count: 18 → 14.

DX Testing Protocol (Planned)

A standardized 60-minute first-contact testing protocol is under development. It will cover installation, a defined task scenario, error recovery, and a structured exit-interview rubric. Currently, most evaluations are research-based with hands-on testing conducted opportunistically. When formal DX testing is completed, it will unlock Grade A evidence and improve signal-level confidence.

Risk Context

The same tool presents different risk profiles depending on use case. A tool assessed as Validated for frontend prototyping carries different implications than one writing data migrations or touching production infrastructure. Scores reflect general enterprise-engineering capability; teams should layer their own use-case risk assessment on top of signal levels.

Evaluation Cadence

Evaluations are published on a monthly release cycle. The AI engineering landscape moves too fast for quarterly cycles — significant security incidents, pricing changes, and major releases can shift a tool's position within days.

Monthly re-scoring uses AI-assisted analysis of new releases, community signals, and market intelligence
Human spot-checks review AI-proposed score changes above a defined threshold before publication
Critical security incidents trigger immediate out-of-cycle updates
Quarterly data quality audits review completeness, score validity, and cap consistency across the full catalog

ACES v2 + Phase A tiers + Phase B anchor recalibration

Core framework effective April 2026 (ACES v2): 6-dimension model, three-signal model (Rating / Signal Level / Evidence Grade). Phase A (June 2026): the 0–100 composite rating is presented as a 4-tier band (Leading / Proven / Emerging / Watch) — numeric score retained as an internal sort key. Phase B (June 2026): recalibrated the per-dimension rubric anchors to be comparative, forced-distribution, and evidence-gated, and consolidated the 7-level rubric to 5 bands — dimension count unchanged at 6.

June 2026

Ready to explore the signal?

View Validated Tools Browse All Tools