I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.
Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).
With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:
https://pypi.org/project/rensa/
Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.
For deduplicating people, the Fellegi Sunter model is also a powerful approach. Splink[0] is a free Python library that implements this for big datasets. Probably you could combine parts of both approaches as well. Full disclosure: I am the lead author.
I've also written up some interactive tutorials on how the method works [1] if anyone's interested
[0]https://github.com/moj-analytical-services/splink [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/
Interesting.
What would you say are the main differences with other approaches?
The Fellegi Sunter model is able to estimate the importance of different types of information from the data itself (i.e. unsupervised learning).
For instance, a match on a date of birth column lends a greater weight of evidence in favour of a match than a match on first name (since dob has higher cardinality).
The method is also able to estimate weights for fuzzy matches (how much evidence in favour of a match is close match on dob with one character difference), and also how much evidence against a match a mismatch is.
For instance, if you have very high data quality on gender, then a match on gender doesn't tell you much, but a mismatch on gender is quite strong evidence against the idea two records match.
I have a blog post here that delves into this a bit more: https://www.robinlinacre.com/fellegi_sunter_accuracy/
Ok, so you get better accuracy if you datasets have obviously weighted fields. Do you pay that in perfs, and if yes how much?
The performance is pretty good because the prediction is ultimately just adding up match weights. Much of the performance is dictated by
(1) The blocking approach you choose (how wide you cast the net in searching for matches. This is actually somewhere were minhash can be used in conjunction
(2) whether you choose to use complex fuzzy matching functions and how many - this is chosen by the user.
There's some benchmarking results here: https://www.robinlinacre.com/fast_deduplication/
Overall it's an approach which makes a pretty good trade off between speed and accuracy. That's why it's used by many national stats institutes (UK, US, Aus, Germany etc.) - because it's capable of working on country-population sized datasets.
It doesn't sound like the approaches are incompatible. You can use minhash LSH to search a large set and get a top-k list for any individual, then use a weighted average with penalty rules to decide which of those qualifies as a dupe or not. Weighted minhash can also be used to efficiently add repeats to give some additional weighting.
Wonderful library and a really great set of explainers! Thanks for sharing this. Wish I had this for quite a few past projects…
There is also gaoya ( I am the author ), which is also written in Rust and has python bindings. datasketch is great, but it is not performant enough for my use case. gaoya is used in a production system for large scale clustering https://github.com/serega/gaoya