Why matching people across different datasets is so hard

This article explores the challenges of identifying the same person in different datasets and why matchmaking in disconnected databases is so hard!

Some of the most important data organisations gather is data about people. When all the data about a single person is neatly connected by some shared ID, the full value of the data can be realised - but when it isn’t, what can you do? Connecting data about the behaviour of your users across different parts of your service is vital for truly understanding those users, but unless a common user ID was somehow associated with all the data when it was collected it’s impossible to 100% reliably match people from data about them alone.

How data gets disconnected

Datasets will always be disconnected, unless something has been done to connect them. A single service will usually ask users to log in or otherwise identify themselves in some way, and so will have some kind of ID number or code for users, and all the data gathered about the user will be tied to that ID. But when services are developed independently within an organisation, or organisations merge or partner, that won’t be the case.

That can and will happen with all sorts of data, not just data about people - but it’s most damaging in this case, and there are unique challenges in connecting personal data, so that’s where we will focus in this post.

The end result is that you have multiple tables with rows for people - and no comparable unique ID field in common between them.

Matching data records

The crux of the problem is matching records from different tables that refer to the same person.

You might think this is easy if you hold common personal data such as names and email addresses. Matching on that will indeed find you a set of candidate connections between records, but this set will be incomplete because:

Multiple people can have the same name. If you have two John Smiths in one database and three in another, how do you know which is which? Even if you just have one John Smith in each database, can you be sure they’re the same person?
People can have multiple names. Is Joe Smith in one database the same as Joseph Smith in another? Not to mention thornier cases where one database stores “first / lastname”, the other stores “first / middle / lastname”, and we have a “Mary” “Smith Jones” in one and a “Mary” “Smith” “Jones” in another.
Email addresses are a bit better as they’re usually distinct. But not always. Email addresses are often shared among less-technical people, either within a family or between a team, or between successive holders of a role. But even more commonly, people tend to have multiple - either due to changing jobs, or through having separate work/personal addresses. And the more diverse the services you’re combining, or the longer time period records might have been created over, the higher chance of one individual having different email addresses.
Things like national insurance numbers are even better than email - but there’s less chance of such specific personal data having been collected; and there are still cases where people end up with more than one NI number.
Anything about a person can change - name, address, gender, sex, email address - potentially even date of birth in the rare cases where records are found to have been erroneous. Even biometrics like fingerprints and retinal patterns can change, perhaps due to injury.
And in all cases, some fields might be missing - or, worse, misspelt.

So if all you can get by matching values is an idea of some pairs of records that might be connected, what can you actually do?

Handling Uncertainty

If the only reason you’re linking records is for analytics, you can decide to go with “fuzzy” “good enough” matches. A small fraction of incorrect matches will not distort the data too much - and if you’re lucky, the errors will cancel each other out on average.

But you can’t afford to use those approximate matches for transactional uses. If you link records in ways that users will be able to see - perhaps they log into service A and see their data from service B inside service A - then the real danger is that if you incorrectly match a record form service B that actually belongs to a different user, then the user logging into service A will see somebody else’s service B data. At best, that is embarrassing to you. At worst, it is a data breach with legal penalties.

The situation is pretty dire - but all is not lost.

Involve the users

Perhaps you can migrate your users from two or more separate login systems for different services to a single one. That will, in the end, certainly result in having a consistent user ID across your services - all new users will sign up with the single system, so as soon as you introduce that, the situation will slowly start improving. But what about existing records? You still need to match them to create new unified accounts for the existing users.

However, here is where connecting data about actual users of your system has a unique characteristic: you can interactively involve the users.

Imagine an existing user goes to log into one of the services you have now unified to a single sign-on system. When their username is entered, your system can recognise that this is a user from the service’s old database. But there are two cases to consider - they might have used another of your services and have a unified account, or they might not. In the former case, the user needs to be asked to log into their unified account as well as the legacy account, so we can link the two (we need the user to log into both so that attackers can’t just “claim” user accounts that aren’t really theirs). Once that is done, we can migrate the legacy account data into the new unified account.

In the latter case, where the user hasn’t already got a unified account, we need to take them through an abbreviated account creation process that mainly just migrates their legacy account data to a unified account, informing them of any differences that matter to them and letting them confirm their details are still up to date.

Authenticating without Passwords

However, things are a lot less clear-cut when services that do not authenticate users with a login are involved. Perhaps some of your services date back to postal or telephone-based systems; if a user produces an account ID that they once put on a paper order form, can you be sufficiently sure it’s really them, without ever having agreed anything like a password with them? In such cases, you generally need to validate against account history. Perhaps you can consult the history of orders placed with that account ID and see if the user can confirm them. This can be challenging as the user might be relying on memory - you can’t ask them a question and expect an exact match to be typed in, but perhaps you can display the details of one of their orders alongside nine plausible randomly-generated sets of details and ask them which one they recognise. Do this five times and if they get four or more right, that might be a sufficiently high bar.

Conclusion

The key to identifying different records about the same person is how you handle the inherent uncertainty. False positives (where you incorrectly match two records that are about different people) and false negatives (where you fail to match records that ARE about the same person) each have a cost, and thorough matching processes involving manual intervention or talking to the user have their own costs; and the magnitude of those costs depends very much on your situation.

If your goal is to match records for analytical purposes, the cost of false positives is usually low, so a purely automated system that fuzzily matches records will probably be fine. On the other hand, if you are supporting someone's decision to hire the user into a position of responsibility and need to match against records to verify their identity and qualifications, the cost of a false positive could be catastrophic, and the cost of manual verification on a case by case basis is entirely justified.

Assessing these costs, and weighing them against the reliability of different automated techniques for finding possible matches, takes skill and experience - so if you’re caught in a matchmaking dilemma, don’t feel that you have to struggle alone - let us help you! Our data experts have a wealth of experience in handling these types of challenges and thrive on finding the right solutions. Why not contact us to discuss your specific situation now and let’s find the right way forwards together! You can also explore our Data Management services to learn more about what we do.

Author

Featured

Alaric Snell-Pym

Alaric is an engineer specialising in understanding complex problems and producing simple solutions. They have a wide range of experience implementing everything from line of business systems to distributed databases comprising thousands of nodes.