Methodology, status, and next-pass plan

This dataset is the first pass at the brief in research-plan.md: build a complete, source-cited census of every Y Combinator company in the public directory, with cohort and cross-year analysis. The full methodology document is committed alongside the data; this page is the practical summary.

Companies in directory
5,909
matches YC's own published count
Snapshot
2026-05-16
re-runnable in <1 min
Census fields
100%
filled for every row
Long-tail fields
~25
prominent alumni deeply researched

Sources & priority

The base census comes from the official YC Startup Directory, mirrored as JSON by the yc-oss/api project (snapshot 2026-05-16, the same day this dataset was built). The seeded financial overlay (IPOs, big acquisitions, the largest private funding rounds) was hand-cited against the priority order from research-plan.md:

  1. Official YC company profile — used for every row.
  2. Official company website / press / SEC EDGAR — used for the High-confidence overlay.
  3. Reputable databases (Crunchbase, Dealroom) — not yet used; flagged in data_gaps.csv.
  4. Reputable news — not yet used.

Confidence model

Confidence is recorded per field, not per row.

What's done

What's partial / what's a gap

The YC public JSON dump does not include:

Every long-tail gap is in data_gaps.csv with a recommended_next_step.

Next-pass plan

In order of leverage:

  1. Scrape YC profile HTML for founder names + LinkedIns. Public site exposes them; JSON doesn't. ~5,909 page fetches, throttled — a multi-hour but tractable job.
  2. Crunchbase/Dealroom export keyed on YC slug to populate funding_rounds.csv + company_metrics.csv for the long tail.
  3. Acquirer + date sweep for the 772 unattributed acquisitions. Press archives, Wikipedia, Crunchbase.
  4. HTTP-HEAD probe of every "Active" website to catch obvious shutdowns/rebrands.
  5. Manual recategorization of the ~140 "Other"-sector rows — small absolute count, high value for taxonomy quality.
  6. AI-flag calibration — current heuristic is noisy on borderline cases. Sample-based manual review would tighten precision/recall.

Reproducibility

The build is fully scripted and uses no API keys. To reproduce on a fresh YC dump:

git clone --depth 1 https://github.com/yc-oss/api.git /tmp/yc-api
python3 build_master.py     # writes the 10 CSVs
python3 build_reports.py    # writes the xlsx + markdown reports
python3 build_site_data.py  # bakes JSON for this site

Files & raw data

The full dataset lives next to this site source as CSVs and a single XLSX workbook:

Any quantitative claims downstream should be qualified as: "Based on the official YC directory snapshot 2026-05-16, this dataset contains 5,909 launched YC company profiles. Some companies may be missing if YC has hidden or removed profiles. Funding, revenue, and founder details are sparse outside ~25 prominent alumni."