Sunday, June 29, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Why Probabilistic Linkage is More Accurate than Fuzzy Matching or Term Frequency based approaches | by Robin Linacre | Oct, 2023

October 26, 2023
in Data Science & ML
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter



How effectively do different approaches to record linkage use information in the records to make predictions?

Wringing information out of data. Image created by the author using DALL·E 3

A pervasive data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities together.

In the absence of a unique identifier such as a Social Security number, we can use a combination of individually non-unique variables such as name, gender, and date of birth to identify individuals.

To get the best accuracy in record linkage, we need a model that wrings as much information from this input data as possible.

This article describes the three types of information that are most important in making an accurate prediction, and how all three are leveraged by the Fellegi-Sunter model as used in Splink.

It also describes how some alternative record linkage approaches throw away some of this information, leaving accuracy on the table.

The three types of information

Broadly, there are three categories of information that are relevant when trying to predict whether a pair of records match:

1. Similarity of the pair of records
2. Frequency of values in the overall dataset, and more broadly measuring how common different scenarios are
3. Data quality of the overall dataset

Let’s look at each in turn.

1. Similarity of the pairwise record comparison: Fuzzy matching

The most obvious way to predict whether two records represent the same entity is to measure whether the columns contain the same or similar information.

The similarity of each column can be measured quantitatively using fuzzy matching functions like Levenshtein or Jaro-Winker for text, or numeric differences such as absolute or percentage difference.

For example, Hammond vs Hamond has a Jaro-Winkler similarity of 0.97 (1.0 is a perfect score). It’s probably a typo.

These measures could be assigned weights and summed together to compute a total similarity score.

The approach is sometimes known as fuzzy matching, and it is an important part of an accurate linkage model.

However, using this approach alone has a major drawback: the weights are arbitrary:

– The importance of different fields has to be guessed at by the user. For example, what weight should be assigned to a match on age? How does this compare to a match on first name? How should we decide on the size of punitive weights when information does not match?
– The relationship between the strength of prediction and each fuzzy matching metric has to be guessed by the user, as opposed to being estimated. For example, how much should our prediction change if the first name is a Jaro-Winkler 0.9 fuzzy match as opposed to an exact match? Should it change by the same amount if the Jaro-Winkler score reduces to 0.8?

2. Frequency of values in the overall dataset, or more broadly measuring how common different scenarios are

We can improve on fuzzy matching by accounting for the frequency of values in the overall dataset (sometimes known as ‘term frequencies’).

For example, John vs John, and Joss vs Joss are both exact matches so have the same similarity score, but the latter is stronger evidence of a match than the former because Joss is an unusual name.

The relative term frequencies of John v Joss provide a data-driven estimate of the relative importance of these different names, which can be used to inform the weights.

This concept can be extended to encompass similar records that are not an exact match. Weights can be derived from an estimate of how common it is to observe fuzzy matches across the dataset. For example, if it’s really common to see fuzzy matches on first name at a Jaro-Winkler score of 0.7, even amongst non-matching records, then if we observe such a match, it doesn’t offer much evidence in favor of a match. In probabilistic linkage, this information is captured in parameters known as the u probabilities, which is described in more detail here.

3. Data quality of the overall dataset: measuring the importance of non-matching information

We’ve seen that fuzzy matching and term frequency-based approaches can allow us to score the similarity between records and even, to some extent, weight the importance of matches on different columns.

However, none of these techniques help quantify the relative importance of non-matches to the predicted match probability.

Probabilistic methods explicitly estimate the relative importance of these scenarios by estimating data quality. In probabilistic linkage, this information is captured in the m probabilities, which are defined more precisely here.

For example, if the data quality in the gender variable is extremely high, then a non-match on gender would be strong evidence against the two records being a true match.

Conversely, if records have been observed over a number of years, a non-match on age wouldn’t be strong evidence of the two records being a match.

Probabilistic linkage

Much of the power of probabilistic models comes from combining all three sources of information in a way that is not possible in other models.

Not only is all of this information incorporated in the prediction, but the partial match weights in the Fellegi-Sunter model enable the relative importance of the different types of information to be estimated from the data itself and hence weighted together correctly to optimize accuracy.

Conversely, fuzzy matching techniques often use arbitrary weights and cannot fully incorporate information from all three sources. Term frequency approaches lack the ability to use information about data quality to negatively weight non-matching information or a mechanism to appropriately weight fuzzy matches.

The author is the developer of Splink, a free and open-source Python package for probabilistic linkage at scale.



Source link

Tags: AccurateapproachesBasedFrequencyFuzzyLinacreLinkageMatchingOctProbabilisticRobinTerm
Previous Post

Knightscope launches $10M public safety infrastructure bond offering

Next Post

Best programming language in 2023 || Top programming language from 2000 to 2023 😨🤯||#itdevelopment

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
Best programming language in 2023 || Top programming language from 2000 to 2023 😨🤯||#itdevelopment

Best programming language in 2023 || Top programming language from 2000 to 2023 😨🤯||#itdevelopment

First Trade: Zee Business Live | Share Market Live Update | Stock Market News Live| 17th August 2023

First Trade: Zee Business Live | Share Market Live Update | Stock Market News Live| 17th August 2023

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
How ‘Chain of Thought’ Makes Transformers Smarter

How ‘Chain of Thought’ Makes Transformers Smarter

May 13, 2024
Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

October 2, 2023
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

October 30, 2023
How To Build A Quiz App With JavaScript for Beginners

How To Build A Quiz App With JavaScript for Beginners

February 22, 2024
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In