Methodology, status, and next-pass plan
This dataset is the first pass at the brief in research-plan.md: build a complete, source-cited census of every Y Combinator company in the public directory, with cohort and cross-year analysis. The full methodology document is committed alongside the data; this page is the practical summary.
Sources & priority
The base census comes from the official YC Startup Directory, mirrored as JSON by the yc-oss/api project (snapshot 2026-05-16, the same day this dataset was built). The seeded financial overlay (IPOs, big acquisitions, the largest private funding rounds) was hand-cited against the priority order from research-plan.md:
- Official YC company profile — used for every row.
- Official company website / press / SEC EDGAR — used for the High-confidence overlay.
- Reputable databases (Crunchbase, Dealroom) — not yet used; flagged in
data_gaps.csv. - Reputable news — not yet used.
Confidence model
Confidence is recorded per field, not per row.
- High — confirmed by primary source (SEC filing, company press, YC directory itself for batch/status). Used for every base YC field, every overlay row, every public-company financial.
- Medium — one reputable third-party source, or YC's status field (which lags real-world status). Used for the bulk of
status_confidenceand inferred sector/category fields. - Low / Unknown — not yet populated; reserved for the next pass on quietly-defunct companies and unverified rumors.
What's done
- Full census. All 5,909 YC companies, every batch from Summer 2005 to Winter 2027 (in-progress).
- Normalized status. Active / Inactive / Acquired / Public mapped 1:1 from YC's directory.
- Sector categorization. One of 32 normalized sectors per company plus pipe-delimited secondary sectors, derived from YC tags + industry/subindustry.
- Vertical flags. AI, Fintech, Healthcare, Biotech, DevTools, Climate, Crypto/Web3, Marketplace, Hardtech, plus B2B/B2C/Consumer/Enterprise mapping.
- Geography. City / state / country parsed; US vs non-US trended over time.
- Cohort + cross-year reports. Markdown deliverables and this site's drill-down views.
- Hand-verified overlay. 9 IPOs (Airbnb, Coinbase, Dropbox, DoorDash, Instacart, Reddit, PagerDuty, GitLab) plus Stripe; 7 verified acquisitions (Twitch, Cruise, Heroku, Parse, Segment, Omni, iCracked); 28 funding rounds; 11 financial metrics; 42 founders. Every row sourced.
What's partial / what's a gap
The YC public JSON dump does not include:
- Founder names — present on YC profile HTML but not in JSON. Filled for 42 founders across the prominent alumni.
- Funding rounds, valuations — Crunchbase-style data is absent. Filled for the top ~25 alumni.
- Revenue, ARR, customer counts — only public-company SEC data + Stripe's annual letter.
- Acquirer + acquisition date for 772 of the 779 acquired companies — single highest-leverage missing field.
- Status freshness — many 2020–2022 companies still tagged Active have quietly stopped operating. YC doesn't update promptly.
Every long-tail gap is in data_gaps.csv with a recommended_next_step.
Next-pass plan
In order of leverage:
- Scrape YC profile HTML for founder names + LinkedIns. Public site exposes them; JSON doesn't. ~5,909 page fetches, throttled — a multi-hour but tractable job.
- Crunchbase/Dealroom export keyed on YC slug to populate
funding_rounds.csv+company_metrics.csvfor the long tail. - Acquirer + date sweep for the 772 unattributed acquisitions. Press archives, Wikipedia, Crunchbase.
- HTTP-HEAD probe of every "Active" website to catch obvious shutdowns/rebrands.
- Manual recategorization of the ~140 "Other"-sector rows — small absolute count, high value for taxonomy quality.
- AI-flag calibration — current heuristic is noisy on borderline cases. Sample-based manual review would tighten precision/recall.
Reproducibility
The build is fully scripted and uses no API keys. To reproduce on a fresh YC dump:
git clone --depth 1 https://github.com/yc-oss/api.git /tmp/yc-api
python3 build_master.py # writes the 10 CSVs
python3 build_reports.py # writes the xlsx + markdown reports
python3 build_site_data.py # bakes JSON for this site
Files & raw data
The full dataset lives next to this site source as CSVs and a single XLSX workbook:
yc_company_master.xlsx— combined workbook (10 sheets).yc_company_master.csv— 5,909 rows.crawl_manifest.csv— every YC profile URL with parse status.categories.csv— sector / business-model / regulated-industry / AI-native flags.funding_rounds.csv,company_metrics.csv,founders.csv,status_events.csv,exits.csv,sources.csv,data_gaps.csv.yc_yearly_reports.md,yc_cross_year_analysis.md,methodology_and_data_dictionary.md,executive_summary.md.
Any quantitative claims downstream should be qualified as: "Based on the official YC directory snapshot 2026-05-16, this dataset contains 5,909 launched YC company profiles. Some companies may be missing if YC has hidden or removed profiles. Funding, revenue, and founder details are sparse outside ~25 prominent alumni."