AI-system-design

How I Built a RAG System on the SpaceX S-1 in One Weekend

SpaceX filed a 308-page S-1 on May 20, 2026. I wanted to ask it questions and get cited answers — not summaries, not someone else's interpretation. The actual text, with a page reference I could verify.

So I built AskS1.com. Here's what that actually involved — including the parts that didn't work.

Why RAG, Not Just Upload to Claude

The obvious approach is uploading the PDF to Claude or ChatGPT and asking questions. It works, mostly. But it has three problems.

First, the SpaceX S-1 was filed after most model training cutoffs. For specific figures the model has no training data — it either says "I don't know" or hallucinates a plausible number. I benchmarked this: asking Claude directly about SpaceX's 2025 revenue without context produces a confident wrong answer.

Second, a 308-page document strains context windows. Models start losing details from the middle of the document when they're trying to hold everything at once. Important disclosures on pages 80-200 get deprioritized for content near the beginning and end.

Third, citations are vague. "According to the filing" isn't useful when you're trying to verify a specific governance claim before an IPO.

RAG solves all three. You precompute the embeddings once, retrieve only the relevant chunks at query time, and the model sees focused context rather than 308 pages of noise.

The Architecture

SpaceX S-1 PDF (308 pages → 869 chunks)
    ↓ pdfplumber — extract text page by page
    ↓ sliding window chunker — 400 words, 100 overlap
    ↓ all-MiniLM-L6-v2 — embed chunks → 384-dim vectors
    ↓ Qdrant Cloud — store 869 vectors + page metadata

User question
    ↓ all-MiniLM-L6-v2 — embed query
    ↓ cosine similarity → top 15 candidates
    ↓ re-rank — penalize summary and financial statement pages
    ↓ Claude Haiku — generate cited answer
    ↓ ±20 page range citation

Four components. Each does one thing.

Two separate models — intentional design. all-MiniLM-L6-v2 handles embeddings only. Claude Haiku handles generation only. Embedding models are optimized for semantic similarity — small, fast, deterministic, 384 dimensions. Generation models are optimized for instruction following and text quality. Using the same model for both would mean either a slow embedding step or a weak generation step. Keeping them separate is standard RAG practice and worth being explicit about.

Why Qdrant. Qdrant's free tier is generous enough for a single filing (869 chunks, 384 dimensions). The HNSW index makes similarity search fast at this scale. Local Qdrant works for development — Qdrant Cloud for production without managing infrastructure.

308 pages → 869 chunks. Average 2.8 chunks per page after the sliding window. Total vectors stored: 869 × 384 dimensions. Each chunk stores text, page number, and end page in the payload — retrieved alongside the vector for citation generation.

The Chunking Decision

400 words per chunk with 100-word overlap. Why these numbers?

Smaller chunks (200 words) lose context for multi-sentence financial disclosures. A revenue figure appears on one line; the explanation — segment breakdown, YoY comparison, key drivers — spans the next five sentences. Split at 200 words, you retrieve the number without the context.

Larger chunks (800 words) reduce retrieval precision. You retrieve more text than you need and dilute the relevant signal with adjacent content.

The 100-word overlap ensures no fact gets cut at a chunk boundary without appearing in an adjacent chunk. Any sentence that spans two chunks will be fully retrievable from either side.

Why Claude Haiku for Generation

I benchmarked five models on 15 SpaceX S-1 questions — factual recall, multi-step reasoning, and structured output — with RAG context injected each time:

Model	Overall	Latency
Claude Haiku	4.7/5	2.8s
phi4:14b (local)	4.5/5	27.6s
qwen2.5:14b (local)	4.4/5	26.9s
mistral:7b (local)	4.4/5	9.0s
deepseek-r1:14b (local)	4.3/5	102.8s

The quality gap between Haiku and local 14B models is 0.2 points. The latency gap is 10x. For a web product where users are waiting for an answer, Haiku wins decisively.

One interesting finding: structured output scores were nearly identical across all models (4.4-4.6). The differentiation came entirely from factual accuracy and reasoning — where Haiku's training data and instruction following consistently outperformed locally-run open models.

The Challenges

The summary and financial statement pages problem.

The executive summary (pages 1-30) mentions every major topic at a high level — consistently scoring highest in semantic similarity for almost any query, even when detailed content existed 100+ pages later. The financial statements appendix (pages 230+) has the opposite problem — dense tables that score high for any numerical query regardless of relevance.

Fix: retrieve 15 candidates, then re-rank with penalties at both ends — 0.15 penalty for pages under 30 (executive summary) and 0.15 for pages over 230 (financial statements). Most substantive disclosures live in the middle of the filing. Penalizing both ends keeps retrieval focused on the narrative sections where specific claims and governance details actually appear.

The page citation problem.

This was the hardest part. The S-1 exists as HTML only on SEC EDGAR — no official PDF. I saved it to PDF via Chrome's print function, which creates a fundamental mismatch: Chrome reflows HTML to fit the page during rendering, so the text layer in the PDF doesn't align with what you see visually.

I tried four approaches:

Standalone number regex — looking for numbers at the bottom of each page. Failed because financial tables have numbers everywhere.

Position-based extraction — using pdfplumber's word coordinates to find numbers in the center-bottom 10% of each page. Same problem — numbers appear throughout the page content.

WeasyPrint HTML→PDF conversion — would produce properly aligned text/visual layers but requires GTK libraries that are painful to install on macOS. Abandoned after dependency hell.

paged.js — a JavaScript library specifically designed for HTML pagination. Worth exploring but the 11.8MB HTML filing with separate image assets made this complex.

The solution: Chrome adds N/308 page indicators in the footer during printing. This pattern is unique — it can't appear elsewhere in the filing. Regex extracts it reliably. Citations show a ±8 page range (~p.68-86) to account for the Chrome rendering offset, anchored to the extracted number. Honest about the uncertainty, still directionally useful.

pypdf missing page 1.

The initial implementation used pypdf. It silently skipped page 1 on the Chrome-printed PDF — a known limitation with complex layouts. Switched to pdfplumber, which handles all pages correctly.

Stale demo card citations.

The demo cards on the landing page showed p.1 citations after re-ingestion because Next.js was caching the /api/demo route. Fixed with cache: 'no-store' on the Qdrant fetch and export const dynamic = 'force-dynamic' on the route. A cached wrong citation on the landing page is the first thing any engineer evaluating the product sees — worth fixing before launch.

The filing is a moving target.

SpaceX filed two amendments after the original S-1 — S-1/A #1 on June 1 and S-1/A #2 on June 3 — with updated financials and the IPO price range ($135/share). The RAG pipeline re-ingests any filing version in under 5 minutes. When Anthropic and OpenAI file their S-1s later this year, the same pipeline handles them.

Conversation Memory

The app maintains conversation history across turns. Follow-up questions work without re-explaining context — "which segment is most profitable?" after asking about revenue breakdown uses the prior exchange. History is passed as the Anthropic messages array, capped at the last 10 exchanges to keep context window usage bounded.

Stack

Frontend: Next.js 14 on Railway. Vector storage: Qdrant Cloud. Generation: Anthropic API (Claude Haiku). Embeddings: @xenova/transformers running all-MiniLM-L6-v2 in Node.js. Domain on Cloudflare.

Ingestion is separated from retrieval — embeddings are computed locally and pushed to Qdrant Cloud once. No embedding API calls at query time, which reduces latency and cost per query.

What's Next

Anthropic and OpenAI S-1s are expected later this year. asks1.com will be there when they file.

Planned: paragraph-level citations, financial table extraction via pdfplumber, cross-filing comparisons across historical IPOs.

Built with Claude API, Qdrant, Next.js, sentence-transformers, and pdfplumber. Deployed on Railway.

I'm a software engineer working on large-scale ads infrastructure. This was a weekend project to learn RAG engineering by applying it to something real.