NSF SBIR Applicant · Spring 2026

Research-grade data, not scraped dumps.

Empire Data Solutions is building a cross-source entity resolution layer across 40+ U.S. public record sources — FMCSA, SEC EDGAR, state corporate registries, CMS NPPES, USAspending, EPA ECHO, and state professional licensing boards. Output: a unified entity_id and per-field confidence score attached to every record. Work in progress. Prototype targeting Q4 2026.

UEIHMMYRMPX6XK3 EIN42-1863044 LLCKY 1567795.06.99999 NAICS518210 · 541512
// METHODOLOGY

Three open research problems, one commercial thesis.

The public-record ecosystem generates hundreds of millions of records per year across federal, state, and municipal sources — with no unique cross-source identifier. Existing open-source tooling (Dedupe.io, Splink, Zingg) handles homogeneous deduplication within a single source. Cross-schema, cross-authority resolution remains open. We're working on three tightly-coupled problems.

01

Cross-Source Entity Resolution

Resolve "same company" across FMCSA (DOT number), SEC EDGAR (CIK), IRS Pub 78 (EIN), state corp registries (state-specific IDs), USAspending (recipient UEI), and EPA ECHO (facility ID). No shared primary key. Name variants, historical addresses, DBA/parent-subsidiary lineage. Approach: locality-sensitive hashing for candidate generation + LightGBM pairwise classifier trained on 20,000 hand-labeled cross-source pairs. Target: precision ≥ 0.95 at recall ≥ 0.85 on held-out validation.

Record Linkage LSH Blocking LightGBM Active Learning
STATUS Prototype · seed corpus in labeling BENCHMARK Targeting public release Q4 2026
02

Record-Level Confidence Scoring

Binary "match/no-match" outputs fail downstream compliance workflows. Every resolved record carries a continuous confidence score per-field, driven by a secondary classifier trained to predict whether the primary resolver's outputs are correct. Approach: meta-labeling framework (De Prado, 2018) adapted from quantitative finance — previously unapplied to record linkage. Commercial hook: first data vendor that tells you which rows to trust.

Meta-Labeling Calibration Uncertainty
STATUS Literature review · method adaptation in progress
03

Predictive Refresh Scheduling

Refreshing a 200M+ row inventory on flat schedules wastes 85% of compute budget. Instead, predict per-field change probability between refresh cycles — target the 15% of records most likely to have changed. Approach: survival analysis + gradient-boosted regression on 90 days of held-out update events from state corporate registries. Target: 5-10x infrastructure cost reduction versus flat-schedule baseline.

Survival Analysis Change Prediction Cost Optimization
STATUS Scoped · Phase II candidate
// ROADMAP

Where this goes.

Phase I targets a working entity-resolution pipeline with public benchmark release. Phase II scales to all 50 states + sector-specialized resolvers (healthcare, financial services, freight/logistics).

Q2 2026

Seed corpus + benchmark

Hand-label 20,000 cross-source pairs spanning FMCSA↔SEC, FMCSA↔state corp, NPI↔state license. Release as open benchmark.

Q3 2026

Resolution prototype

LightGBM pairwise resolver + LSH blocking. Validate precision/recall targets on held-out set. Open-source reference implementation.

Q4 2026

Commercial integration

"Resolved Entity" product tier on empiredatasolution.com. Cross-source linked records for enterprise insurance, freight, compliance buyers.

2027

Phase II scale-out

Refresh scheduler + confidence framework in production. Technical paper targeting SIGKDD or CIKM.

// Honest status, April 2026

What exists today: Empire Data Solutions ships commercial data products sourced from 40+ federal and state public-record agencies. 211M source rows ingested, 60M+ normalized records in the active catalog. Paying customers. Full source provenance per delivery.

What does not exist yet: The entity-resolution layer described above. We are in seed-corpus construction. NSF SBIR Phase I pitch filing Q2 2026. Full benchmark and open-source reference implementation targeted by end of 2026. We don't claim capabilities we don't have.

Academic collaboration welcome. We're actively seeking record-linkage researchers for advisor roles on Phase I award. Labeled seed corpus, held-out validation set, and compute budget available to qualifying collaborators.

// COLLABORATE

Research or commercial.

Academic partnerships for Phase I advisors. Enterprise partnerships for early access to resolved-entity product tier. Two different conversations — one email address.