Back to Projects
demoML/AImldata

Financial Downturn Prediction Pipeline from SEC Risk Filings

End-to-end NLP pipeline to extract SEC 10-K Item 1A risk factors and label filings by post-filing market outcomes for downstream supervised modeling.

2026-012026-04

Key Highlights

  • End-to-end pipeline: 10-K download → Item 1A extraction → market outcome labeling
  • Deterministic section parsing (Item 1A→Item 1B) with audit-ready artifacts
  • Supervised dataset labeled high-risk on ≥30% drawdown within 90 days post-filing

Overview

An end-to-end pipeline that downloads SEC 10-K filings, extracts Item 1A (Risk Factors) text, and joins each filing to post-filing stock market outcomes to produce a supervised dataset for risk modeling.

Problem

Quantitative analysis of SEC filings requires structured data, but 10-K documents are unstructured HTML/text with inconsistent formatting across companies and years.

Solution

Built a reproducible pipeline that parses Item 1A sections using deterministic boundary detection (Item 1A→Item 1B), retrieves daily adjusted close prices via Alpha Vantage, and labels each filing as high-risk when the stock experienced a ≥30% drawdown within 90 days after filing.

My Contributions

  • Built download and parsing pipeline for SEC EDGAR 10-K filings
  • Implemented deterministic section extraction for Item 1A (Risk Factors)
  • Created time-series labeling logic using post-filing market data
  • Produced audit-ready artifacts with one row per filing (CIK, ticker, filing date, accession, risk text)
  • Designed fixed event window definition for reproducible labeling
  • Dataset & Evaluation

    Data Sources: SEC EDGAR (data.sec.gov) for filings, Alpha Vantage for daily adjusted close prices

    Output: Supervised dataset with one row per filing, labeled high-risk on ≥30% drawdown within 90-day post-filing window

    Challenges & Tradeoffs

    Challenge: Inconsistent HTML structure across different companies' 10-K filings made section extraction unreliable.

    Solution: Used deterministic boundary parsing (Item 1A→Item 1B markers) with fallback heuristics, validated against a sample of manually reviewed filings.