Empire Data Solutions is building a cross-source entity resolution layer across 40+ U.S. public record sources — FMCSA, SEC EDGAR, state corporate registries, CMS NPPES, USAspending, EPA ECHO, and state professional licensing boards.
Output: a unified entity_id and per-field confidence score attached to every record. Work in progress. Prototype targeting Q4 2026.
The public-record ecosystem generates hundreds of millions of records per year across federal, state, and municipal sources — with no unique cross-source identifier. Existing open-source tooling (Dedupe.io, Splink, Zingg) handles homogeneous deduplication within a single source. Cross-schema, cross-authority resolution remains open. We're working on three tightly-coupled problems.
Resolve "same company" across FMCSA (DOT number), SEC EDGAR (CIK), IRS Pub 78 (EIN), state corp registries (state-specific IDs), USAspending (recipient UEI), and EPA ECHO (facility ID). No shared primary key. Name variants, historical addresses, DBA/parent-subsidiary lineage. Approach: locality-sensitive hashing for candidate generation + LightGBM pairwise classifier trained on 20,000 hand-labeled cross-source pairs. Target: precision ≥ 0.95 at recall ≥ 0.85 on held-out validation.
Binary "match/no-match" outputs fail downstream compliance workflows. Every resolved record carries a continuous confidence score per-field, driven by a secondary classifier trained to predict whether the primary resolver's outputs are correct. Approach: meta-labeling framework (De Prado, 2018) adapted from quantitative finance — previously unapplied to record linkage. Commercial hook: first data vendor that tells you which rows to trust.
Refreshing a 200M+ row inventory on flat schedules wastes 85% of compute budget. Instead, predict per-field change probability between refresh cycles — target the 15% of records most likely to have changed. Approach: survival analysis + gradient-boosted regression on 90 days of held-out update events from state corporate registries. Target: 5-10x infrastructure cost reduction versus flat-schedule baseline.
Phase I targets a working entity-resolution pipeline with public benchmark release. Phase II scales to all 50 states + sector-specialized resolvers (healthcare, financial services, freight/logistics).
Hand-label 20,000 cross-source pairs spanning FMCSA↔SEC, FMCSA↔state corp, NPI↔state license. Release as open benchmark.
LightGBM pairwise resolver + LSH blocking. Validate precision/recall targets on held-out set. Open-source reference implementation.
"Resolved Entity" product tier on empiredatasolution.com. Cross-source linked records for enterprise insurance, freight, compliance buyers.
Refresh scheduler + confidence framework in production. Technical paper targeting SIGKDD or CIKM.
What exists today: Empire Data Solutions ships commercial data products sourced from 40+ federal and state public-record agencies. 211M source rows ingested, 60M+ normalized records in the active catalog. Paying customers. Full source provenance per delivery.
What does not exist yet: The entity-resolution layer described above. We are in seed-corpus construction. NSF SBIR Phase I pitch filing Q2 2026. Full benchmark and open-source reference implementation targeted by end of 2026. We don't claim capabilities we don't have.
Academic collaboration welcome. We're actively seeking record-linkage researchers for advisor roles on Phase I award. Labeled seed corpus, held-out validation set, and compute budget available to qualifying collaborators.
Academic partnerships for Phase I advisors. Enterprise partnerships for early access to resolved-entity product tier. Two different conversations — one email address.