Factory

Autonomous Agents

Proven

75.0/100

Continue the conversation — chat opens pre-seeded with the current signal, caps, and movement.

Enterprise-focused autonomous agent with strong (no longer #1) benchmark standing and a broad governance/interface surface. Differentiator: LLM-agnostic design (GPT-5.x, Claude, Gemini, o-series), agent scaffolding that beats OpenAI's own Codex agent by 2.2pts on the same model on Terminal-Bench 2.0, Custom Droids for specialized subagents, and 40+ MCP integrations.

Benchmark note (May 2026): Droid + GPT-5.3-Codex scores 77.3% on Terminal-Bench 2.0 but has been overtaken — Codex CLI + GPT-5.5 (~82%), Claude Mythos (~82%), and others now lead. The #1 claim from late-2025 (63.1%) is retired.

Agent Readiness framework provides systematic codebase assessment across 8 technical pillars and 5 maturity levels, with automated remediation to improve agent effectiveness.

Enterprises wanting async agent capability with strong compliance posture should evaluate Factory. Teams needing multi-model flexibility (switch between GPT-5, Claude, Gemini mid-task) and interface-agnostic operation find a compelling option. Transparent pricing—Free tier with BYOK makes evaluation accessible. BAA availability enables healthcare sector adoption.

AI Autonomy

14/20

Integration

15/20

Contextual Understanding

15/20

Compliance

15/20

Viability

15/20

User Interface

16/20

Adoption & Proof Points

Customers: Nvidia, Morgan Stanley, Adobe, EY, Palo Alto Networks, Adyen, MongoDB, Bayer, Zapier; hundreds of thousands of developers use Droids daily.
Benchmarks:
Terminal-Bench 2.0: 77.3% (Droid + GPT-5.3-Codex), ~#6 as of May 2026 — leadership lost to Codex CLI+GPT-5.5 (~82%) and Claude Mythos (~82%)
Beats OpenAI's own Codex agent by 2.2pts on the same model (scaffolding advantage)
Growth: revenue doubling MoM for 6 months pre-Series C; $129M revenue cited for 2025
Funding: $150M Series C at $1.5B (Apr 2026, Khosla-led) — 5x from $50M Series B at $300M (Sep 2025). Board: Keith Rabois (Khosla).
Pricing: Free (BYOK), Pro ($20/mo), Max ($200/mo), Ultra ($2k/mo), Enterprise (contact sales); overage ~$2.70/1M standard tokens, cached 90% cheaper.

Recommended Use Cases

Enterprises wanting multi-model flexibility (GPT-5, Claude, Gemini, o3 in one subscription)
Teams with compliance requirements (ISO 42001, SOC 2, HIPAA with BAA)
Organizations needing interface-agnostic agents (CLI/IDE/Slack/Browser)
Large-scale migration, refactoring, and CI/CD automation
Teams wanting to capture tribal knowledge as Custom Droids
Organizations seeking to improve codebase agent-readiness systematically
Healthcare organizations requiring HIPAA-compliant AI development tools

Risks & Limitations

Benchmark leadership lost: Terminal-Bench 2.0 #1 ceded — Droid+GPT-5.3-Codex (77.3%) now ~#6 vs ~82% leaders (Codex CLI+GPT-5.5, Claude Mythos)
SOC 2 Type I only: Type II unconfirmed at primary source (operational effectiveness over time)
No FedRAMP; HIPAA/BAA unconfirmed this cycle
Token-cost unpredictability: token-billing model produces surprise charges on large context / long-running / multi-model tasks (5+ independent sources)
Reliability/support complaints: documented stuck-session and slow-support reports (e.g., 2-month resolution)
Code quality repo-discipline dependent: best results require strong CI/review culture
Always-on/background agents on roadmap, not shipped
Hands-on testing needed: assessment remains documentation/review-based
Customer-reported metrics (31x faster, 96.1% shorter migrations) need independent verification

Capabilities & Integration

Agentic depth: Terminal-Bench 2.0 77.3% with Droid + GPT-5.3-Codex (May 2026, ~#6 — leadership lost to Codex CLI+GPT-5.5 ~82% and Claude Mythos ~82%; Factory still beats OpenAI's own Codex agent by 2.2pts on the same model). Missions orchestrates multi-day work (median ~2hr, 14% >24hr, longest 16 days) via orchestrator/workers/validators with 10+ Droids in parallel. Custom Droids let teams create specialized subagents with custom prompts, tool access, and model selection. Headless mode (droid exec) for CI/CD, migrations, batch scripts. Claims 31x faster feature delivery, 96.1% shorter migrations (vendor-reported, unverified).

Agent Readiness (Jan 2026): Framework for measuring and improving codebase readiness for autonomous development. 8 technical pillars (Style/Validation, Build System, Testing, Documentation, Dev Environment, Debugging/Observability, Security, Task Discovery). 5 maturity levels with gated progression (80% threshold). CLI (/readiness-report), Web Dashboard, and API access. Automated remediation fixes failing criteria automatically. Applied to open-source repos: CockroachDB (L4, 74%), FastAPI (L3, 53%), Express (L2, 28%). Evaluation variance reduced from 7% to 0.6% through grounding methodology.

Context handling: Org-wide codebase "mental model" with real-time indexing. 40+ MCP integrations with OAuth registry for one-click setup. Session persistence across interfaces. Native integrations: GitHub/GitLab, Jira, Slack, PagerDuty, Datadog, Sentry, Google Drive.

Integration surface: CLI, VS Code, JetBrains, Vim, Slack, Linear, Browser. Multi-interface continuity—same context follows across terminal, IDE, browser, and productivity tools.

Extensibility: LLM-agnostic (GPT-5, Claude Sonnet 4, OpenAI o3, Gemini 2.5 Pro, Claude Opus 4.1, GLM-4.6). Custom triggers and scripts via headless mode. Custom Droids as version-controlled team knowledge.