Dedup at network scale: how we keep the same applicant from being sold twice

Dedup is the boring problem that everyone asks about and nobody publishes the architecture for. Here's ours.

The naïve version

Every time a lead comes in, search the database for a matching email or phone in the last 30 days. If there's a hit, decline.

This breaks fast. Applicants change phone format mid-form. They re-submit with Mr. prefixed to their name. Some networks accept country code with +, some without. Some normalize whitespace, some don't. After a year of "why didn't this match?" tickets, you give up and add a fuzzy matcher.

What we actually do

We compute three independent identity hashes per applicant at submission time:

Strong identity — normalized national ID + DOB. If two applicants share this, they're the same person, full stop.
Contact identity — normalized phone (E.164) + email (lowercased, plus-stripped). High confidence; sometimes shared (couples, roommates).
Profile identity — first name + last name + birth year + postcode. Lower confidence; useful for catching applicants who change their contact details between submissions.

Each hash is computed once and stored alongside the lead. On every new submission, we lookup all three hashes against the last 30 days.

What we do with matches

A match on strong identity within 30 days = automatic block. Lender gets one notification of the match; they don't see the lead twice.

A match on contact identity within 7 days = pass through with a "contact-recently-seen" flag. The lender decides whether to underwrite or not.

A match on profile identity within 30 days = pass through with a softer flag and the prior submission ID for the lender to cross-reference.

Why three hashes, not one

Because applicants are messy. Real-life sample of one customer service ticket: same person, three submissions across two brokers, two of three matched only on profile identity (they moved house and changed their phone). One hash would have missed two of the three.

What we publish

The audit log shows: which hash matched, when, what we did. Both the inbound broker and the original receiving lender can query their own slice of this. Regulators see the full thing on request. We never expose the hashes themselves to partners — they're internal — but we expose the matched submission IDs.

What's hard

The actual hard part isn't computing hashes. It's normalizing inputs: phone-number country detection, name canonicalization across Nordic languages, postcode formats per country. We maintain that normalization pipeline as a first-class system, separate from routing, separate from storage. It gets unit-tested with millions of historical records and a weekly canary against a known-bad input set.

Where we're going

We'd like to publish the hash schemas openly so other networks can interoperate at the dedup layer. The conversations are ongoing. The biggest blocker is alignment on normalization — once everyone agrees on how to canonicalize a Danish phone number, the rest is mechanical.

Dedup at network scale: how we keep the same applicant from being sold twice

The naïve version

What we actually do

What we do with matches

Why three hashes, not one

What we publish

What's hard

Where we're going

Keep reading.

Why algo routing beats waterfalls — measured in basis points

Compliance posture for a cross-border consumer-lending network

What lenders actually look for in a loan lead

Want updates when we publish?