DisclosureLens

Sources & Methodology

This page documents what DisclosureLens does to a regulator-published filing between the moment it appears on a public portal and the moment it appears in the API. Eight pipeline stages, one schema, four model-grade SLOs.

Accuracy SLOs

Source list

We ingest only primary regulatory sources plus declared press supplementation. Current Tier-1 sources (V1):

Threat-actor leak sites

We additionally ingest ransomware-group leak-site claims via ransomware.live, a public aggregator of structured listings from threat-actor extortion blogs (data used under the ransomware.live API PRO terms; Source: Ransomware.live). These rows are hidden from the default disclosure feed. They only appear when the user opts in via the “Show threat-actor claims” toggle. They render with a danger-tone “🦹 Leak Site” chip and a claim banner on the detail view.

Why this source exists: leak-site postings often predate the victim's regulatory filing by weeks to months, and that gap is meaningful — it's the “knew but didn't disclose” window that informs governance, insurance, and plaintiff-bar analysis. Every regulatory disclosure carries a graduated pre_disclosure_leak_gt_30d / gt_90d / gt_180d compliance flag when a same-victim leak-site listing predates it by the corresponding threshold.

Treat leak-site rows as the threat actor's assertion, not a confirmed breach. We do not link to .onion claim URLs from the UI even when ransomware.live exposes them. For groups designated under OFAC sanctions (Conti, Evil Corp, LockBit operators designated Feb 2024, ALPHV/BlackCat), we surface a sanctions-caution callout. Public reporting on sanctioned-group activity is permitted under OFAC general license D; payment or material support to those groups is not.

Cross-source linkage

When a ransomware group publicly claims a victim, and the same victim later files a regulatory breach notice, we link the two filings so the disclosure timeline reads as one incident. Linked filings render in a “Linked disclosures” card on each side's detail page with a confidence chip and the gap-days between the leak-site claim and the regulatory filing.

Signals used. The matcher requires the same resolved victim entity on both sides (a one-side-empty or unresolved-name pair is not a candidate). Within a 180-day window, candidate pairs are scored on: entity-id match (always 1.0 in the candidate pool by construction), narrative similarity via Voyage-3-large embeddings (cosine), threat-actor cross-reference (1.0 when both sides resolve to the same threat-actor entity-id, 0.6 for a lower-cased string match), data-type overlap, geographic overlap, malware-family overlap, and affected-count order-of-magnitude. Weights are documented in packages/python/df-core/df_core/incident_match.py; all matcher source is open-source in the repo.

Confidence tiers. Each rendered link carries a tier chip:

Operator review. Borderline pairs default to a human review queue rather than auto-merging silently. An operator reviews the side-by-side comparison + per-signal score breakdown and Confirms (link both sides into one incident), Rejects (writes a sticky negative record so the matcher won't re-propose), or Defers. The operator-led policy is intentional — the public-facing surface only carries links a human has approved, OR the matcher has scored above the high-confidence threshold.

False-positive handling. A wrongly-linked pair can be unlinked by an operator via the admin merge tool; the unlink writes a sticky record so the matcher won't re-link the same pair. Customers who spot a false-positive in their own filings can email contact@disclosurelens.com; we prioritize first-party correction requests over the standard review queue.

Limitations. A threat-actor leak claim is not proof a breach occurred — some leak-site listings are exaggerated, recycled from old breaches, or fabricated. The linkage we surface is “same victim entity, same time window, narrative similarity”; the chip language reflects that confidence rather than a forensic finding. The matcher does not cross sources outside the same victim entity — perpetrator-only linkage (e.g., “all the Akira victims”) lives in the threat-actor pages, not in the disclosure-level cross-source links.

Extraction pipeline

Each source document passes through:

  1. Sanitization — Unicode normalization, dangerous-HTML stripping, bidi override removal, prompt-injection guard.
  2. Triage — Claude Haiku 4.5 classifier decides if the document is a breach disclosure and selects an extraction template.
  3. Structured extraction — Claude Sonnet 4.6 with Instructor + Pydantic schema validation produces a BreachDisclosure v1 object with per-field source-span citations and per-field confidence scores.
  4. Hard-pass review — Claude Opus 4.7 with extended thinking re-extracts any document where Pass-2 overall confidence is < 0.80 or any single field is < 0.70.
  5. Entity resolution — name + LEI/CIK lookup via the GLEIF and SEC EDGAR registries.
  6. Cross-jurisdiction dedup — same incident filed across multiple jurisdictions (SEC + state AG + OCR) is grouped under a canonical incident id.
  7. PII redaction — natural-person names appearing in incident narratives are replaced with type-tags before customer-visible fields are emitted.
  8. Human review queue — any record below the confidence threshold is reviewed by a DisclosureLens operator.

Extraction audit trail

Every extracted record carries:

GET /v1/disclosures/{id}/extraction returns the full envelope. Re-extraction with a newer prompt or model produces a new envelope, not a mutation; the record carries the version chain.

Compliance note: this also satisfies EU AI Act Article 50, which takes effect August 2, 2026. We built it because customers — security analysts, GRC reviewers, plaintiff-bar researchers — need to defend their use of an extracted value in front of someone who will challenge it. The regulation arrived later.

Corrections

Email corrections@disclosurelens.com — 48-hour SLA from receipt.

Provenance

Every record carries:

§4.5 Editorial claim — Fair Report Privilege

DisclosureLens publishes structured extracts of public records. Raw filings from federal public-domain regulators (SEC EDGAR, HHS OCR) are rendered inline on the disclosure detail page from the originating regulator’s public record, under the Fair Report Privilege. Raw filings from state attorney-general portals, EU DPA registers, UK ICO datasets, Australia OAIC reports, and threat-actor leak-site aggregators are linked to but not redistributed in full from our public surfaces. The editorial work — schema design, entity resolution, cross-jurisdiction linking, confidence scoring — is the original contribution that anchors the FRP claim across all sources.

SEC + HHS inline rendering. SEC 8-K / 10-K filings and HHS OCR breach- report rows are federal public-domain records with no third-party copy restriction and no defamation surface (regulator-filed, not third-party-asserted). Inline rendering surfaces the original document alongside our structured extract so a reader can verify or correct any field. The rendered document is server-side sanitized (script / iframe / form / style stripped) and embedded in a sandboxed iframe with no script execution; the dashboard never modifies or paraphrases the original body, only the structural rendering.

State AG + leak-site posture (unchanged). State AG portals carry varying per-portal terms-of-use; threat-actor claims carry defamation risk by their nature (the claim is the threat actor’s accusation, not a confirmed breach). Both surfaces retain the “structured extract + link to source” posture so the editorial transformation remains the public-facing contribution.