Sources & Methodology

This page documents what DisclosureLens does to a regulator-published filing between the moment it appears on a public portal and the moment it appears in the API. Eight pipeline stages, one schema, four model-grade accuracy targets.

Accuracy targets

These are service-level objectives — the accuracy the pipeline is designed and tuned to hit — not independently measured or audited performance figures. Field-level accuracy varies by source and field; report discrepancies via the corrections process below.

Victim entity resolution: ≥ 99% accuracy
Dates: ≥ 98% accuracy (±1 day)
Numeric counts: ≥ 95% accuracy
Attack vector classification: ≥ 90% (often not disclosed)

Source list

We ingest primary regulatory sources plus two declared low-trust supplements (leak-site claims and press coverage), each labelled as such on every record. The authoritative, always-current list — with per-source freshness SLOs — is the live source-health table. Current coverage:

US SEC EDGAR — 8-K Item 1.05, 8-K Item 8.01, 10-K Item 1C
US state Attorneys General — 17 states, including California, Texas, Washington, Maine, Maryland, Vermont, and Oregon (breach-notification portals; Maine is currently dark at the source and Hawaii's portal prunes its own archive — see the coverage table)
US HHS Office for Civil Rights — HIPAA breach report ("Wall of Shame")
EU data-protection authorities — enforcement decisions via GDPRhub
Ransomware leak sites (via ransomware.live) — threat-actor claims, always labelled unverified until a filing corroborates
Press coverage — journalistic reports, labelled as non-regulatory

Source kinds — every source, one grammar

DisclosureLens carries six source kinds, from statutory filings to attacker self-reports. Each has a fixed badge, a fixed render behavior, and — crucially — a fixed honest ceiling on what it can populate. The badge tells a reader how much to trust a record before they read a word of it; clicking any source badge on a disclosure page opens this reference in place.

Source kind	Source body	Populates	Compliance clock	Typical conf.
SEC filings Federal · material events	Inline render — full filing embedded (fair-report §4.5)	discovery + materiality dates · attack vector · threat actor; rarely affected counts	SEC 4-business-day (hard)	62–87%
HHS OCR Federal · health (HIPAA)	Inline render — full filing embedded (fair-report §4.5)	affected count (required) · PHI data types · breach type	HHS 60-day (submission date only — the portal omits the discovery date)	67–97%
State AG State breach notices	Link-out — body stays at the regulator; we link + extract (§4.5)	per-state resident counts · data types · discovery date	State 30/60-day · indicative (web-form dates run systematically late)	58–95%
EU DPA GDPR supervisory authorities	Link-out — body stays at the regulator; we link + extract (§4.5)	data categories · cross-border scope · outcome + articles (decisions)	GDPR 72-hour (indicative)	95%
Press Media / market postings	Link-out — journalistic coverage links to the source URL	incident type + narrative only (may be machine-translated)	not assessable	60%
Leak site Threat-actor extortion blogs	Aggregated via ransomware.live — claims, not filings	actor name · victim claim · ransom/leak status	not assessable	50%

Color roles are consistent across the feed, incident, and disclosure views: teal — confirmed · on-time · high confidence; amber — claim · indicative · caveat; rust — unverified · late · discrepancy. Confidence bands describe what we observe in the current corpus, not a guarantee.

Threat-actor leak sites

We additionally ingest ransomware-group leak-site claims via ransomware.live, a public aggregator of structured listings from threat-actor extortion blogs (data used under the ransomware.live API PRO terms; Source: Ransomware.live). These rows are segregated into a dedicated, publicly-browsable “Claims” feed, kept separate from confirmed regulatory filings so a threat-actor assertion is never presented as a filed disclosure. They render with a danger-tone “🦹 Leak Site” chip and a claim banner on the detail view.

Why this source exists: leak-site postings often predate the victim's regulatory filing by weeks to months, and that gap is meaningful — it's the “knew but didn't disclose” window that informs governance, insurance, and plaintiff-bar analysis. Every regulatory disclosure carries a graduated pre_disclosure_leak_gt_30d / gt_90d / gt_180d compliance flag when a same-victim leak-site listing predates it by the corresponding threshold.

Treat leak-site rows as the threat actor's assertion, not a confirmed breach. We do not link to .onion claim URLs from the UI even when ransomware.live exposes them. For groups with US sanctions exposure — where OFAC has designated the group or its operators (Evil Corp as an entity; LockBit operators, Feb 2024; Trickbot-gang members tied to Conti, 2023) — we surface a sanctions-caution callout. Public reporting on sanctioned-group activity is permitted (informational-materials exemption, 50 U.S.C. § 1702(b)(3)); payment or material support to those groups is not.

Cross-source linkage

When a ransomware group publicly claims a victim, and the same victim later files a regulatory breach notice, we link the two filings so the disclosure timeline reads as one incident. Linked filings render in a “Linked disclosures” card on each side's detail page with a confidence chip and the gap-days between the leak-site claim and the regulatory filing.

Signals used. The matcher requires the same resolved victim entity on both sides (a one-side-empty or unresolved-name pair is not a candidate). Within a 180-day window, candidate pairs are scored on: entity-id match (always 1.0 in the candidate pool by construction), narrative similarity via Voyage-3-large embeddings (cosine), threat-actor cross-reference (1.0 when both sides resolve to the same threat-actor entity-id, 0.6 for a lower-cased string match), data-type overlap, geographic overlap, malware-family overlap, and affected-count order-of-magnitude. Weights are documented in packages/python/df-core/df_core/incident_match.py; all matcher source is open-source in the repo.

Confidence tiers. Each rendered link carries a tier chip:

Verified by operator — a named DisclosureLens operator clicked Confirm on the pair via the admin review queue. The strongest tier, independent of the numeric score.
Verified — matcher score ≥ 0.85 (auto-merge threshold).
Likely — matcher score ≥ 0.60 (auto-merge threshold for leak↔regulatory pairs).
Candidate — matcher score below the auto-merge thresholds.

Operator review. Borderline pairs default to a human review queue rather than auto-merging silently. An operator reviews the side-by-side comparison + per-signal score breakdown and Confirms (link both sides into one incident), Rejects (writes a sticky negative record so the matcher won't re-propose), or Defers. The operator-led policy is intentional — the public-facing surface only carries links a human has approved, OR the matcher has scored above the high-confidence threshold.

False-positive handling. A wrongly-linked pair can be unlinked by an operator via the admin merge tool; the unlink writes a sticky record so the matcher won't re-link the same pair. Customers who spot a false-positive in their own filings can email contact@disclosurelens.com; we prioritize first-party correction requests over the standard review queue.

Limitations. A threat-actor leak claim is not proof a breach occurred — some leak-site listings are exaggerated, recycled from old breaches, or fabricated. The linkage we surface is “same victim entity, same time window, narrative similarity”; the chip language reflects that confidence rather than a forensic finding. The matcher does not cross sources outside the same victim entity — perpetrator-only linkage (e.g., “all the Akira victims”) lives in the threat-actor pages, not in the disclosure-level cross-source links.

Extraction pipeline

Each source document passes through:

Sanitization — Unicode normalization, dangerous-HTML stripping, bidi override removal, prompt-injection guard.
Triage — the self-hosted Qwen3.6-35B-A3B model (served via MLX) classifies whether the document is a breach disclosure and selects an extraction template.
Structured extraction — the same self-hosted Qwen3.6-35B-A3B model, with Instructor + Pydantic schema validation, produces a BreachDisclosure v1 object with per-field confidence scores.
Hard-pass review — high-stakes fields carry per-field escalation thresholds (threat-actor and malware attribution at 0.85; affected counts, industry tags, and materiality dates at 0.66; body-ungrounded discovery dates at 0.50). A below-threshold field triggers a harder re-extraction pass, and named attributions additionally face an adversarial verify pass before persistence.
Entity resolution — name + LEI/CIK lookup via the GLEIF and SEC EDGAR registries.
Cross-jurisdiction dedup — same incident filed across multiple jurisdictions (SEC + state AG + OCR) is grouped under a canonical incident id.
PII redaction — structured PII (email addresses, SSNs, phone numbers, and payment-card numbers) appearing in incident narratives is replaced with type-tags before customer-visible fields are emitted. Natural-person names are not yet redacted.
Human review queue — any record below the confidence threshold is reviewed by a DisclosureLens operator.

Extraction audit trail

Every extracted record carries:

extraction.model — the exact model and version (e.g. mlx-community/Qwen3.6-35B-A3B-4bit)
extraction.system_prompt_version — the prompt revision in effect at extraction time
extraction.extracted_at — UTC timestamp; cost_usd, input/output/cache token counts
extraction.overall_confidence — and per-field confidence scores
ai_assisted: true on every record; meta.ai_assisted: true on every API envelope

GET /v1/disclosures/{id}/extraction returns the full envelope. Re-extraction with a newer prompt or model produces a new envelope, not a mutation; the record carries the version chain.

Compliance note: this also satisfies EU AI Act Article 50, which takes effect August 2, 2026. We built it because customers — security analysts, GRC reviewers, plaintiff-bar researchers — need to defend their use of an extracted value in front of someone who will challenge it. The regulation arrived later.

Corrections

Email corrections@disclosurelens.com — 48-hour SLA from receipt.

Provenance

Every record carries:

source.url — the originating regulator-published URL
source.filed_at — the regulator-recorded filing timestamp
source.raw_hash — SHA-256 of the source body at ingestion time
GET /v1/disclosures/{id}/extraction → model, provider, extracted_at, system_prompt_version, token counts, cost_usd
overall_confidence on the same response (per-field confidence is on the roadmap; tracked at extraction time but not yet persisted)

§4.5 Editorial claim — Fair Report Privilege

DisclosureLens publishes structured extracts of public records. Raw filings from federal public-domain regulators (SEC EDGAR, HHS OCR) are rendered inline on the disclosure detail page from the originating regulator’s public record, under the Fair Report Privilege. Raw filings from state attorney-general portals, EU DPA registers, UK ICO datasets, Australia OAIC reports, and threat-actor leak-site aggregators have their structured extracts published (and indexed) on our public surfaces, while their raw source bodies are linked to rather than redistributed in full. The editorial work — schema design, entity resolution, cross-jurisdiction linking, confidence scoring — is the original contribution that anchors the FRP claim across all sources.

SEC + HHS inline rendering. SEC 8-K / 10-K filings and HHS OCR breach- report rows are federal public-domain records with no third-party copy restriction and no defamation surface (regulator-filed, not third-party-asserted). Inline rendering surfaces the original document alongside our structured extract so a reader can verify or correct any field. The rendered document is server-side sanitized (script / iframe / form / style stripped) and embedded in a sandboxed iframe with no script execution; the dashboard never modifies or paraphrases the original body, only the structural rendering.

State AG + leak-site posture (unchanged). State AG portals carry varying per-portal terms-of-use; threat-actor claims carry defamation risk by their nature (the claim is the threat actor’s accusation, not a confirmed breach). Both surfaces retain the “structured extract + link to source” posture so the editorial transformation remains the public-facing contribution.