Financial Downturn Prediction Pipeline from SEC Risk Filings

Key Highlights

End-to-end pipeline: 10-K download → Item 1A extraction → market outcome labeling
Deterministic section parsing (Item 1A→Item 1B) with audit-ready artifacts
Supervised dataset labeled high-risk on ≥30% drawdown within 90 days post-filing

Overview

An end-to-end pipeline that downloads SEC 10-K filings, extracts Item 1A (Risk Factors) text, and joins each filing to post-filing stock market outcomes to produce a supervised dataset for risk modeling.

Problem

Quantitative analysis of SEC filings requires structured data, but 10-K documents are unstructured HTML/text with inconsistent formatting across companies and years.

Solution

Built a reproducible pipeline that parses Item 1A sections using deterministic boundary detection (Item 1A→Item 1B), retrieves daily adjusted close prices via Alpha Vantage, and labels each filing as high-risk when the stock experienced a ≥30% drawdown within 90 days after filing.

My Contributions

Built download and parsing pipeline for SEC EDGAR 10-K filings

Implemented deterministic section extraction for Item 1A (Risk Factors)

Created time-series labeling logic using post-filing market data

Produced audit-ready artifacts with one row per filing (CIK, ticker, filing date, accession, risk text)

Designed fixed event window definition for reproducible labeling

Dataset & Evaluation

Data Sources: SEC EDGAR (data.sec.gov) for filings, Alpha Vantage for daily adjusted close prices

Output: Supervised dataset with one row per filing, labeled high-risk on ≥30% drawdown within 90-day post-filing window

Challenges & Tradeoffs

Challenge: Inconsistent HTML structure across different companies' 10-K filings made section extraction unreliable.

Solution: Used deterministic boundary parsing (Item 1A→Item 1B markers) with fallback heuristics, validated against a sample of manually reviewed filings.