NSF SBIR Applicant · Topic AA4 · Spring 2026

The unsolved problem behind government data.

Empire Data Solutions has aggregated 60M+ records from 40+ U.S. federal and state public-record sources as preliminary data for an open ML research problem: cross-source entity resolution across schema-diverse government registries with no shared identifier and no labeled training data. Phase I research scope: cross-source entity resolution, record-level confidence scoring, and LLM-driven schema inference for administrative records. Output: a unified entity_id + per-field confidence + reproducible benchmark dataset, targeting NSF SBIR Phase I 2026.

UEIHMMYRMPX6XK3 EIN42-1863044 LLCKY 1567795 CAGE1ZNV9 NAICS518210 · 541512

// METHODOLOGY

Three open research problems, one commercial thesis.

The public-record ecosystem generates hundreds of millions of records per year across federal, state, and municipal sources — with no unique cross-source identifier. Existing open-source tooling (Dedupe.io, Splink, Zingg) handles homogeneous deduplication within a single source. Cross-schema, cross-authority resolution remains open. We're working on three tightly-coupled problems.

Cross-Source Entity Resolution

Resolve "same company" across FMCSA (DOT number), SEC EDGAR (CIK), IRS Pub 78 (EIN), state corp registries (state-specific IDs), USAspending (recipient UEI), and EPA ECHO (facility ID). No shared primary key. Name variants, historical addresses, DBA/parent-subsidiary lineage. Resolving entities across schema-diverse government registries without labeled ground truth is an active open ML problem — recent NSF SBIR awards have funded analogous heterogeneous-source linkage work in healthcare and scientific data domains. Approach: locality-sensitive hashing for candidate generation + gradient-boosted pairwise classifier with active-learning loop over a hand-labeled seed corpus drawn from our existing 30-source aggregation. Target: precision ≥ 0.95 at recall ≥ 0.85 on held-out cross-source validation; benchmark dataset published as Phase I deliverable.

Record Linkage LSH Blocking LightGBM Active Learning

STATUS Phase I research target · seed corpus in construction BENCHMARK Public release targeted Q1 2027 RELEVANT NSF TOPIC AA4 — Knowledge & Data Management Technologies

Record-Level Confidence Scoring

Binary "match/no-match" outputs fail downstream compliance workflows where regulated buyers (insurance underwriters, healthcare credentialing, legal due-diligence) need to know which rows are trustworthy before acting on them. Every resolved record carries a continuous confidence score per-field, driven by a secondary classifier trained to predict whether the primary resolver's outputs are correct. Approach: meta-labeling framework (De Prado, 2018) adapted from quantitative finance — previously unapplied to record linkage. Commercial hook: first data vendor that tells you which rows to trust, before you stake money on them.

Meta-Labeling Calibration Uncertainty

STATUS Phase I research target · method adaptation in progress

LLM-Driven Schema Inference

Each new state and federal source ships records in a different format — different field names, ordering, encoding conventions, missingness semantics. Traditional ETL requires weeks of human inspection per source. We're researching whether a language model can infer the schema, semantic field types, and quality conventions of a previously-unseen administrative dataset from a small sample of rows + minimal metadata, with no labeled training data. Approach: few-shot structure induction + cross-source semantic-type alignment, validated empirically against our 30-source corpus where the ground-truth normalization is already known. Target: reduce new-source onboarding from weeks to hours; published methodology + benchmark.

LLM Schema Inference Few-Shot Learning Semantic Types

STATUS Active research · Phase I deliverable RELEVANT NSF TOPIC AA4 — Knowledge & Data Management Technologies

// ROADMAP

Where this goes.

Phase I targets a working entity-resolution pipeline with public benchmark release. Phase II scales to all 50 states + sector-specialized resolvers (healthcare, financial services, freight/logistics).

Q2 2026

NSF SBIR Phase I pitch

Project Pitch submission to NSF Topic AA4 (Knowledge & Data Management Technologies). Pre-pitch consultation with Program Director. Seed corpus construction in parallel.

Q3 2026

Phase I award + benchmark build

Hand-label 20,000 cross-source pairs spanning FMCSA↔SEC, FMCSA↔state corp, NPI↔state license. LLM schema inference prototype on 5 unseen state sources.

Q4 2026 / Q1 2027

Resolution prototype + open benchmark

LightGBM pairwise resolver + LSH blocking, validated against held-out cross-source pairs. Open-source reference implementation + public benchmark dataset.

2027

Phase II scale-out

Production confidence framework + LLM schema-inference pipeline applied to all 50 states. Technical paper targeting SIGKDD or CIKM.

// Honest status, April 2026

What exists today (preliminary data for Phase I): 211M source rows ingested across 40+ federal and state public-record agencies. 60M+ records aggregated and normalized within each source. Paying customers on per-dataset products. Full source provenance per delivery. SAM.gov ACTIVE (UEI HMMYRMPX6XK3, CAGE 1ZNV9), SBIR Firm registered (SBC_002676599).

Applied artifacts already shipping: Two products demonstrating the approach work in narrow, well-defined scope:
— Contractor Influence Map — cross-reference of SAM contractors × FEC donors × USAspending awards. 148K contractor-row dataset showing first-order entity resolution applied to government accountability data. Joins SAM legal_name to FEC donor_employer with documented match precision.
— NPI Provider Stability v1 — predictive model (AUC 0.750, lift 4.07× at top decile). Demonstrates honest-methodology ML on government data: leakage-audited, modest published metrics, full feature attribution per prediction. Trained 9.4M records.
Both are limited demonstrations of the broader research thesis (cross-source resolution + LLM schema inference at scale) and inform the Phase I work.

What does not exist yet (the Phase I research scope): Cross-source entity resolution across schema-diverse government registries. LLM-driven schema inference. Record-level confidence scoring. These are open ML research problems with no published commercial-grade solution. Seed corpus construction underway. We don't claim capabilities we don't have.

Why this is fundable: Academic record-linkage benchmarks (Wachs et al. 2021, Nature Human Behaviour; TED-data European procurement linkage) operate on AUC 0.71-0.85 — the upper bound of honest performance on real heterogeneous-source linkage. Cross-source government-records resolution at scale, with no shared identifier across federal/state systems and no labeled training data, has not been solved publicly. Recent NSF SBIR Phase I awards (#2324507, #2125909, knowledge-graph for scientific data) confirm this is the agency's funding lane.

Academic collaboration welcome. Actively seeking record-linkage and LLM-schema-inference researchers for advisor roles on Phase I. Labeled seed corpus, held-out validation set, and compute budget available to qualifying collaborators. Email research@empiredatasolution.com.