Key Highlights
- End-to-end pipeline: 10-K download → Item 1A extraction → market outcome labeling
- Deterministic section parsing (Item 1A→Item 1B) with audit-ready artifacts
- Supervised dataset labeled high-risk on ≥30% drawdown within 90 days post-filing
Overview
An end-to-end pipeline that downloads SEC 10-K filings, extracts Item 1A (Risk Factors) text, and joins each filing to post-filing stock market outcomes to produce a supervised dataset for risk modeling.
Problem
Quantitative analysis of SEC filings requires structured data, but 10-K documents are unstructured HTML/text with inconsistent formatting across companies and years.
Solution
Built a reproducible pipeline that parses Item 1A sections using deterministic boundary detection (Item 1A→Item 1B), retrieves daily adjusted close prices via Alpha Vantage, and labels each filing as high-risk when the stock experienced a ≥30% drawdown within 90 days after filing.
My Contributions
Dataset & Evaluation
Data Sources: SEC EDGAR (data.sec.gov) for filings, Alpha Vantage for daily adjusted close prices
Output: Supervised dataset with one row per filing, labeled high-risk on ≥30% drawdown within 90-day post-filing window
Challenges & Tradeoffs
Challenge: Inconsistent HTML structure across different companies' 10-K filings made section extraction unreliable.
Solution: Used deterministic boundary parsing (Item 1A→Item 1B markers) with fallback heuristics, validated against a sample of manually reviewed filings.