Claude Haiku vs Local Models: The Real Tradeoff

Share

27.6 seconds vs 2.8 seconds. That gap isn't a benchmark footnote — it's the difference between a product people use and one they abandon.

I was building AskS1.com, a RAG system for querying the SpaceX S-1. The generation step — taking retrieved chunks and producing a cited answer — needed to be fast enough that someone would actually wait for it. I benchmarked five models to find out which one earned that spot: Claude Haiku, and four 7-14B local models running on a Mac Mini M4 via Ollama.

The overall numbers looked like a rounding error. The category breakdown told a different story.


How I Evaluated

15 questions across three categories, same retrieved context for every model.

Factual recall — can the model extract a specific number correctly?

"What is SpaceX's total revenue for 2025?"
"How many Starlink subscribers does SpaceX have as of Q1 2026?"
"What is SpaceX's total debt as of Q1 2026?"

Multi-step reasoning — does the model connect information across sections and form a judgment?

"Why is SpaceX's AI segment consuming 62-76% of capex but generating 
only 17% of revenue? Is this a concern?"
"Why can't Elon Musk be removed as CEO without his own approval?"
"How does SpaceX's vertical integration give it an advantage?"

Structured output — can the model follow formatting instructions precisely?

"Summarize SpaceX's three business segments in a markdown table 
with columns: Segment, Revenue, Operating Income, Key Product."
"List the top 5 risk factors in order of severity."
"Summarize Elon Musk's compensation structure in exactly 4 bullets."

One factual question was a deliberate curveball — "What RL algorithm does DeepSeek use?" — unrelated to SpaceX entirely, testing whether models would admit "I don't know" or hallucinate an answer just because the context was about a tech company.


Scoring

Two methods for two question types.

Factual recall — scored against ground-truth figures pulled directly from the filing. Exact numbers, keyword matching — does the answer contain the correct revenue figure, subscriber count, debt number.

Reasoning and structured output — scored 1-5 by Claude Sonnet as an LLM judge, evaluating coherence, accuracy, and instruction-following. These don't have single correct answers — "is this sustainable?" requires judgment, not pattern matching.


The Results

Model            Overall  Factual  Reasoning  Structured  Latency
claude-haiku        4.7     5.0       4.8        4.4       2.8s
phi4:14b            4.5     4.4       4.5        4.6      27.6s
qwen2.5:14b         4.4     4.4       4.2        4.6      26.9s
mistral:7b          4.4     4.4       4.0        4.6       9.0s
deepseek-r1:14b     4.3     4.4       3.8        4.6     102.8s

A 0.2-4.4 point spread on a 5-point scale looks like noise. It isn't — it's three different stories stacked on top of each other.


Where the Gap Actually Lives

Structured output: local models win. Every local model scored 4.6, ahead of Haiku's 4.4. Following "exactly 4 bullets" or "markdown table with these columns" doesn't require deep reasoning, and the local models were if anything slightly more literal about compliance.

Reasoning: this is where the real gap is. Haiku scored 4.8. deepseek-r1:14b scored 3.8 — a full point lower, despite taking 102.8 seconds per question, 37x Haiku's latency. These questions asked models to connect numbers across sections and form a judgment — "ARPU is declining but revenue is growing — is this sustainable, and why?" This is where size and training quality actually show up. Interestingly, phi4:14b (4.5) and qwen2.5:14b (4.2) — both 14B — outperformed deepseek-r1:14b (3.8) despite being the same size class. Reasoning quality isn't just a parameter-count story.

Factual recall: one question did almost all the damage. Four of five factual questions, every model scored a perfect 5.0. The entire gap traces to one question — "How many Starlink subscribers does SpaceX have as of Q1 2026?" All four local models answered "10,300 thousand (or 10.3 million)" — numerically correct, but the "10,300 thousand" phrasing tripped the keyword scorer. Haiku said "10.3 million" cleanly and scored full marks. Not a knowledge gap. A units-formatting quirk that cost 2.3 points on one question out of fifteen.

So the honest summary: for structured tasks, local models are competitive or better. For reasoning, there's a real gap, and it scales with model quality more than raw size. For factual recall, the "gap" was mostly an artifact of how I scored one question.

(And for the DeepSeek curveball — Haiku, phi4, qwen2.5, and deepseek-r1 all correctly said "I don't know." mistral:7b confidently described "DeepSeak, a spacecraft navigation autonomous docking system developed by SpaceX" — a system that does not exist. A small reminder that "I don't know" is sometimes the only correct answer, and not every model knows that.)


The Cost Angle

Estimating cost per query for both:

Claude Haiku — roughly 2,900 input tokens (context + system prompt + question) and ~400 output tokens per query comes to about $0.004 per query.

Mac Mini M4 electricity — 27.6 seconds at ~25W draw works out to about $0.00006 per query — roughly 65x cheaper than the API call, in pure electricity terms.

Neither number matters at the scale of a side project. The Mac Mini is "free" because I already own it. The API cost is "free" because it's a fraction of a cent. Cost only becomes the deciding factor at high query volume — thousands of requests per day, where $0.004 × 10,000 = $40/day starts to add up against hardware you already paid for once.


So When Do Local Models Make Sense?

Not "Haiku wins, always." Local models make sense when:

  • Privacy matters — documents that can't leave your machine
  • Offline access is required — no network dependency
  • Volume is high enough that per-query API cost compounds meaningfully
  • Latency tolerance is high — batch processing, overnight jobs, anything where 27 seconds vs 2.8 seconds doesn't matter to a human waiting

For AskS1.com — a public tool where someone types a question and waits — 2.8 seconds is the only viable answer. But for the private Google Drive knowledge base I built on the same Mac Mini, the calculus flips entirely: nothing leaves my machine, nobody's waiting in real time, and the documents are mine. Local models there aren't a compromise — they're the right tool.


Full results