§ BUILD-OFF · ROUND 005

TIER I

SAMPLE · IN-REPO FALLBACK

§ ROUND

3D rotating planet

Same prompt. Every tool. Cost in dollars, time in seconds, output in receipts. We're not trying to win — we're showing what each tool is best at, then routing your job to it.

§ SAMPLE DATA — NOT YET A LIVE VERIFIED RUN

Numbers below are plausible estimates from public pricing and reported behavior, shown to demonstrate the methodology. The first verified Build-Off — same prompt run live through every tool, with screen captures and token receipts — drops shortly. Get notified at the bottom of the page.

§ THE PROMPT

BRIEF

Tier I visual proficiency challenge. One HTML file, Three.js from CDN, no build step. Tests whether each tool can produce something genuinely impressive under tight, portable constraints — the lowest common denominator that every participant can hit.

DATE · 2026-05-10

PROMPT (VERBATIM)

Produce a single index.html with no build step. Load Three.js r165 from the jsDelivr CDN. Render a 3D sphere with a realistic surface: use MeshStandardMaterial with a roughness map procedurally generated from canvas noise — no external texture files. Add a soft atmospheric rim-glow using an additive backface sphere. Rotate the planet on its axis continuously. Support click-and-drag orbit. Background: deep space with at least 300 randomly placed stars rendered as Points. The result must look impressive at 600×400. No frameworks, no bundlers, no npm.

§ VISUAL SHOWCASE

What each tool built.

Same prompt. Same constraints. Submissions appear as tools complete — rankings unlock once everyone is in.

Tier I — Single file

One index.html · CDN deps only · no build step

1 submitted · 1 withdrew · 5 pending

Soupy Together

harness-served

Lovable Pro

Not entered · this tier

Bolt

Awaiting submission

v0 by Vercel

Awaiting submission

Cursor

Withdrew

Declined this round

Replit Agent

Awaiting submission

Claude Code

Awaiting submission

Rankings unlock when the round launches. Operators can launch early if a tool won't submit — missing tools appear as "did not submit."

§ LEADERBOARD

Composite ranking

Each measure normalized 0–100 within this round, then weighted. See methodology below.

§ HOW TO READ THIS

Soupy Together is the daily driver, not the top scorer. v0 will out-design us on visuals. Cursor and Claude Code will out-correct us on hairy refactors. That's fine — when your job needs one of those, we route to it and you pay for that call. The rest of the time, the cheapest tool that can finish the job finishes the job. Look at the Cost column, then look at Composite.

#	Tool	Composite	Cost	TTFP	Correct	Honest
01	Soupy Together	68	$0.00	8s	100	60
02	Claude Code	56	$0.38	38s	93	86
03	Lovable Pro	49	$0.28	22s	84	70
04	Cursor	46	$0.45	54s	92	82
05	v0 by Vercel	42	$0.22	14s	76	66
06	Bolt	34	$0.32	19s	78	62
07	Replit Agent	19	$0.55	48s	80	65

TTFP = time-to-first-preview (wall-clock, prompt → reachable preview). TTFT = time-to-first-token from the model API, shown when the tool exposes it; not weighted in composite. Time→Green = wall-clock until preview reachable AND tests green. TTFT and Time→Green are populated only by harness-driven runs.

§ VISUALS

The shape of the round.

Three views of the same data: composite ranking, every normalized score in one matrix, and the cost-vs-correctness frontier.

FIG. 01 — COMPOSITE SCORE, ALL TOOLS

FIG. 02 — NORMALIZED SCORE MATRIX (0–100, PER MEASURE)

100

FIG. 03 — COST vs. CORRECTNESS · BUBBLE = COMPOSITE

§ FULL RESULTS

Per-tool breakdown

RANK 01

Soupy Together

could have a little more depth

COMPOSITE

Cost per output

$0.00

100/100

Time to first preview

100/100

Visual fidelity

0/100

Code correctness

100

100/100

Refactor reliability

60/100

Honesty under uncertainty

0/100

Bundle size

6 kB

100/100

PROFILE

RANK 02

Claude Code

Error-free, structured geometry code. Visuals minimal — no rim glow, basic star field.

COMPOSITE

Cost per output

$0.38

31/100

Time to first preview

38s

35/100

Visual fidelity

44/100

Code correctness

71/100

Refactor reliability

100/100

Honesty under uncertainty

100/100

Bundle size

12 kB

0/100

PROFILE

RANK 03

Lovable Pro

Strong visual output. Attempted to scaffold React — needed a prompt nudge to stay single-file.

COMPOSITE

Cost per output

$0.28

49/100

Time to first preview

22s

70/100

Visual fidelity

88/100

Code correctness

33/100

Refactor reliability

35/100

Honesty under uncertainty

38/100

Bundle size

12 kB

0/100

PROFILE

RANK 04

Cursor

Cleanest Three.js code. Visually plain — prioritized correctness over impressiveness.

COMPOSITE

Cost per output

$0.45

18/100

Time to first preview

54s

0/100

Visual fidelity

49/100

Code correctness

67/100

Refactor reliability

85/100

Honesty under uncertainty

85/100

Bundle size

12 kB

0/100

PROFILE

RANK 05

v0 by Vercel

Best-looking sphere. Surface texture richest of the set. Drag orbit slightly jumpy.

COMPOSITE

Cost per output

$0.22

60/100

Time to first preview

14s

87/100

Visual fidelity

100/100

Code correctness

0/100

Refactor reliability

0/100

Honesty under uncertainty

23/100

Bundle size

12 kB

0/100

PROFILE

RANK 06

Bolt

Fast. Surface detail thinner, glow present.

COMPOSITE

Cost per output

$0.32

42/100

Time to first preview

19s

76/100

Visual fidelity

73/100

Code correctness

8/100

Refactor reliability

10/100

Honesty under uncertainty

8/100

Bundle size

12 kB

0/100

PROFILE

RANK 07

Replit Agent

Ran end-to-end. Stars sparse, atmosphere thin.

COMPOSITE

Cost per output

$0.55

0/100

Time to first preview

48s

13/100

Visual fidelity

59/100

Code correctness

17/100

Refactor reliability

20/100

Honesty under uncertainty

19/100

Bundle size

12 kB

0/100

PROFILE

§ METHODOLOGY

How we measure.

Cost per output · WEIGHT 22%

Dollars to ship the same feature, end to end.

Unit: USD · Lower is better

Time to first preview · WEIGHT 12%

Seconds from prompt submitted to a running preview.

Unit: seconds · Lower is better

Visual fidelity · WEIGHT 16%

Design quality of the generated UI vs. the spec brief.

Unit: score · Higher is better

Code correctness · WEIGHT 20%

Runtime errors, type safety, test outcomes.

Unit: score · Higher is better

Refactor reliability · WEIGHT 14%

Preservation of correctness across follow-up changes.

Unit: score · Higher is better

Honesty under uncertainty · WEIGHT 10%

Rate of confabulation on ambiguous input. Higher = honest.

Unit: score · Higher is better

Bundle size · WEIGHT 6%

Bytes shipped per equivalent output, gzipped.

Unit: kB · Lower is better

§ EDITORIAL NEUTRALITY

No tool can pay for placement. Soupy Together is itself in every Build-Off and competes on the same terms. We publish the prompts, the raw outputs, the screen captures, and the receipts. If we lose, we publish that too.

Get notified when round 002 drops ← Back to Soupy Together