Methodology, status, and next-pass plan

This dataset is the first pass at the brief in research-plan.md: build a complete, source-cited census of every Y Combinator company in the public directory, with cohort and cross-year analysis. The full methodology document is committed alongside the data; this page is the practical summary.

Companies in directory

5,909

matches YC's own published count

Snapshot

2026-05-16

re-runnable in <1 min

Census fields

100%

filled for every row

Long-tail fields

~25

prominent alumni deeply researched

Sources & priority

The base census comes from the official YC Startup Directory, mirrored as JSON by the yc-oss/api project (snapshot 2026-05-16, the same day this dataset was built). The seeded financial overlay (IPOs, big acquisitions, the largest private funding rounds) was hand-cited against the priority order from research-plan.md:

Official YC company profile — used for every row.
Official company website / press / SEC EDGAR — used for the High-confidence overlay.
Reputable databases (Crunchbase, Dealroom) — not yet used; flagged in data_gaps.csv.
Reputable news — not yet used.

Confidence model

Confidence is recorded per field, not per row.

High — confirmed by primary source (SEC filing, company press, YC directory itself for batch/status). Used for every base YC field, every overlay row, every public-company financial.
Medium — one reputable third-party source, or YC's status field (which lags real-world status). Used for the bulk of status_confidence and inferred sector/category fields.
Low / Unknown — not yet populated; reserved for the next pass on quietly-defunct companies and unverified rumors.

What's done

Full census. All 5,909 YC companies, every batch from Summer 2005 to Winter 2027 (in-progress).
Normalized status. Active / Inactive / Acquired / Public mapped 1:1 from YC's directory.
Sector categorization. One of 32 normalized sectors per company plus pipe-delimited secondary sectors, derived from YC tags + industry/subindustry.
Vertical flags. AI, Fintech, Healthcare, Biotech, DevTools, Climate, Crypto/Web3, Marketplace, Hardtech, plus B2B/B2C/Consumer/Enterprise mapping.
Geography. City / state / country parsed; US vs non-US trended over time.
Cohort + cross-year reports. Markdown deliverables and this site's drill-down views.
Hand-verified overlay. 9 IPOs (Airbnb, Coinbase, Dropbox, DoorDash, Instacart, Reddit, PagerDuty, GitLab) plus Stripe; 7 verified acquisitions (Twitch, Cruise, Heroku, Parse, Segment, Omni, iCracked); 28 funding rounds; 11 financial metrics; 42 founders. Every row sourced.

What's partial / what's a gap

The YC public JSON dump does not include:

Founder names — present on YC profile HTML but not in JSON. Filled for 42 founders across the prominent alumni.
Funding rounds, valuations — Crunchbase-style data is absent. Filled for the top ~25 alumni.
Revenue, ARR, customer counts — only public-company SEC data + Stripe's annual letter.
Acquirer + acquisition date for 772 of the 779 acquired companies — single highest-leverage missing field.
Status freshness — many 2020–2022 companies still tagged Active have quietly stopped operating. YC doesn't update promptly.

Every long-tail gap is in data_gaps.csv with a recommended_next_step.

Next-pass plan

In order of leverage:

Scrape YC profile HTML for founder names + LinkedIns. Public site exposes them; JSON doesn't. ~5,909 page fetches, throttled — a multi-hour but tractable job.
Crunchbase/Dealroom export keyed on YC slug to populate funding_rounds.csv + company_metrics.csv for the long tail.
Acquirer + date sweep for the 772 unattributed acquisitions. Press archives, Wikipedia, Crunchbase.
HTTP-HEAD probe of every "Active" website to catch obvious shutdowns/rebrands.
Manual recategorization of the ~140 "Other"-sector rows — small absolute count, high value for taxonomy quality.
AI-flag calibration — current heuristic is noisy on borderline cases. Sample-based manual review would tighten precision/recall.

Reproducibility

The build is fully scripted and uses no API keys. To reproduce on a fresh YC dump:

git clone --depth 1 https://github.com/yc-oss/api.git /tmp/yc-api
python3 build_master.py     # writes the 10 CSVs
python3 build_reports.py    # writes the xlsx + markdown reports
python3 build_site_data.py  # bakes JSON for this site

Files & raw data

The full dataset lives next to this site source as CSVs and a single XLSX workbook:

yc_company_master.xlsx — combined workbook (10 sheets).
yc_company_master.csv — 5,909 rows.
crawl_manifest.csv — every YC profile URL with parse status.
categories.csv — sector / business-model / regulated-industry / AI-native flags.
funding_rounds.csv, company_metrics.csv, founders.csv, status_events.csv, exits.csv, sources.csv, data_gaps.csv.
yc_yearly_reports.md, yc_cross_year_analysis.md, methodology_and_data_dictionary.md, executive_summary.md.

Any quantitative claims downstream should be qualified as: "Based on the official YC directory snapshot 2026-05-16, this dataset contains 5,909 launched YC company profiles. Some companies may be missing if YC has hidden or removed profiles. Funding, revenue, and founder details are sparse outside ~25 prominent alumni."