Benchmark: Claude Code defaults to DIY, not SaaS tools

Amplifying benchmarked 2,430 Claude Code runs to see what tools it picks without being prompted. In 12 of 20 categories, it most often builds Custom/DIY instead. When it does choose vendors, defaults dominate: GitHub Actions, Stripe, shadcn/ui, Vercel.

claude cover

TL;DR

  • Benchmark of Claude Code tool picks: 2,430 runs, real repos, no tool names in prompts; 85.3% parseable picks
  • Custom/DIY dominates: most common in 12/20 categories; 252 instances, including feature flags, auth, caching
  • Hard default winners: GitHub Actions 93.8%, Stripe 91.4%, shadcn/ui 90.1%, Vercel 100% (JS deploy)
  • Model shifts (“recency gradient”): Prisma 79%→0%, Drizzle→100% (JS ORM); newer models trend more Custom/DIY
  • Incumbents rarely primary picks: Redux 0/88, Jest 7/171, yarn 1/135; alternatives like Zustand, pnpm preferred
  • Deployment split: Vercel 100% (JS); Railway 82% (Python); AWS/GCP/Azure 0 primary picks in 112 responses

When AI-assisted coding shifts from autocomplete to agents that can actually make architectural calls, tool choice stops being a background detail. A benchmark from Amplifying takes a direct look at that question by measuring what Claude Code recommends—and what it decides to build instead—across 2,430 open-ended runs on real repositories, with no tool names included in prompts.

The study spans 3 models (Sonnet 4.5, Opus 4.5, Opus 4.6), 4 project types, and 20 tool categories, yielding an 85.3% extraction rate (2,073 “parseable picks”). Amplifying also notes an update: Sonnet 4.6 shipped Feb 17, 2026, and the benchmark will be rerun against it.

The headline result: it builds more than it buys

Across 12 of 20 categories, the most common behavior wasn’t selecting a vendor tool—it was Custom/DIY. That label shows up 252 times, more than any single named tool.

Examples in the report are concrete and consistent with a “roll it yourself” instinct:

  • For feature flags, Claude Code often produces a config approach using env vars and percentage-based rollout, rather than pointing to a service like LaunchDarkly.
  • For authentication in Python, it goes fully custom in that category: 100% Custom/DIY, including implementations built around JWT and password hashing libraries (the source references JWT + passlib, and also notes a JWT + bcrypt-from-scratch style outcome in its summary examples).
  • For caching, the custom route can look like in-memory TTL wrappers.

The upshot is less “which SaaS is best?” and more “what’s the simplest implementation that fits this repo right now?”—a posture that can materially shape how agent-authored codebases evolve.

When Claude Code chooses tools, it chooses hard defaults

Where the models do select third-party tools, the dataset shows a handful of strong default winners—often close to monopolies within their category:

  • CI/CD: GitHub Actions at 93.8% (152/162)
  • Payments: Stripe at 91.4% (64/70)
  • UI components: shadcn/ui at 90.1% (64/71)
  • Deployment (JS): Vercel at 100% (86/86 JS deployment picks)

The “default stack” list, as presented, skews toward a modern JS ecosystem set: PostgreSQL, Drizzle, NextAuth.js, Tailwind CSS, Vitest, pnpm, Sentry, Resend, Zustand, and React Hook Form, among others.

Model personalities and the “recency gradient”

Amplifying characterizes each model’s selection style:

  • Sonnet 4.5 trends conventional, with strong preferences like Redis at 93% for Python caching, Prisma at 79% for JS ORM, and Celery at 100% for Python jobs.
  • Opus 4.5 is described as balanced and is most likely to name a specific tool (86.7%), distributing picks more evenly.
  • Opus 4.6 appears more forward-looking: Drizzle at 100% for JS ORM and a shift toward newer job tooling like Inngest (50%) in JS, plus a higher tendency to go Custom/DIY (11.4%).

That progression shows up clearly in “recency gradient” comparisons. In a JS ORM race, Prisma drops from 79% (Sonnet 4.5) to 0% (Opus 4.6), while Drizzle rises to 100% within extracted ORM picks. Similar generational change shows up in Python jobs (Celery collapsing in newer models) and caching (Redis falling as Custom/DIY rises).

Notably absent: familiar incumbents

A separate section flags tools with large market share that Claude Code rarely selects as primary picks:

  • Redux: 0/88 primary picks (though 23 mentions), with Zustand chosen 57 times
  • Express: absent as a primary choice in API layer prompts
  • Jest: 7/171 primary picks (around 4%), despite appearing as an alternative in 31 cases
  • Package managers show a similar tilt: yarn gets 1/135 primary picks, while pnpm is frequently preferred

Even when these tools are recognized, the model’s “pick” behavior leans toward newer or more ecosystem-native defaults.

Deployment splits: Vercel for JS, Railway for Python—and no traditional cloud as primary

Deployment decisions are described as stack-determined:

  • For JS frontend (Next.js + React SPA), the benchmark reports a clean sweep: Vercel at 100% (86 of 86).
  • For Python backend (FastAPI), Railway leads at 82%, with Docker at 8%, Fly.io at 5%, and Render at 5%.

One of the more striking data points: across 112 deployment responses, traditional cloud providers (AWS, GCP, Azure) receive zero primary picks. Some services appear as alternatives—Netlify (67 alt), Cloudflare Pages (30 alt), GitHub Pages (26 alt), DigitalOcean (7 alt)—while AWS Amplify is “mentioned but never recommended” as an alternative pick in the dataset.

Differences between models are real—but limited to a few categories

Despite the generational shifts, the benchmark reports agreement in 18 of 20 categories within each ecosystem. The places where models genuinely diverge include ORM (JS), jobs (JS and Python), caching, and real-time—categories where either newer tooling competes with older defaults or “build it” competes with service adoption.

Original source

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community