The Dark Art of Deduplication

We all think we know what duplicates are. But do we really?

Yes we know duplicates are records that reference the same information, held as two (or more) different records. But as many of us have found to our cost, when dealing with data, most duplicates escape the net because they aren’t identical.

First you need to access which items you normally have available within your data.

For a B2C company that is – Name & Address, an email & telephone.
For a B2B company if might be Name, Address, Email, maybe a phone number, but this could be a direct line, reception and/or a mobile.

Then you can choose from the different methodology available:-

1. Basic sorting on fields to position information together.

This is the very start point of any deduplication.

However, if the information has been added to different fields, then this will not bring those records together.

For instance, ‘Flat 1, The House’ compared to ‘Flat 1 The House’.

Also, if two records have different information, then they will not be sorted together, such as ‘Hopewiser House, 1 High Street’ compared to ‘1 High Street’.

Clearly items that are misspelt or are mistyped will not be found and matched together.

2. Phonetic/Soundex Matching.

This can be performed across the whole information.

Phonetics/soundex generally creates a four character key for each word to match against, based on certain criteria, such as ignoring certain letters (e.g. vowels).

This is useful, especially when using limited amounts of data, but taking a full address, forename and surname, does create a lot of information.

If there are extra words, then the algorithm to find the matches become more and more complex and requires a lot more scanning of all the data.

3. Elemental matching.

This looks at certain elements of the data, such as just taking postcode and premise.

However, if there is a typo or fields put in the wrong place, then this will not work.

Just taking a postcode – a simple typo can make things very different.

For exmaple, AB is the area code for Aberdeen, whilst BA is Bath.

Very different places and very far apart, but the same two characters with a huge number of postcodes at each locaiton.

Getting these postcodes wrong can lead to missed deliveries, confused customers, and logistical headaches.

In a country where postcodes like AB12 and BA12 are both perfectly valid but point to locations over 500 miles apart, one small slip can derail an entire operation.

This just highlights how vital accurate address data really is.

4. Keys Based Matching.

A keys based matching process uses some sort of defined key or keys to help find the potential duplicates.

In an ordinary database, a simple key is often added to each record for indexing purposes, which is just a number. Clearly, there should never be a duplicate of this number.

Alternatively, a key could be a simple identifier, such as the Unique Delivery Point Reference Number (UDPRN), from the Royal Mail, which will pull data together, as long as the address has been matched (correctly).

However, there are a myriad of other ways to build a key or multiple keys to help identify duplicates, such as taking the postcode + premise.

Any company or person building a keys based matching system needs to identify the elements they require in the keys, then how they want to process those elements (in full or reduced/standardised forms).

A key is a simplified version of the information, able to unlock the full record, which should be built in a consistent way.

So how do Hopewiser do it?

Our deduplication is based on all of the above, using a Keys Based Matching system.

By using keys, it reduces the amount of data you are comparing, which is especially relevant in today’s ‘big data’ world.

The keys are not based on soundex, but have similarities. However, by creating keys with data fixed in certain positions, with some elements taking the best bits of soundex, then it allows the user to sort, potentially on elements, comparing relevant components of the data, rather than juggling huge sequences.

This Keys Based Matching allows for different types of deduplication, such as bringing together everyone at a single address for a brochure mailing, or at an individual level, if you want to create a single customer view using disparate databases.

Hopewiser concentrate on the address as the main key, since this is a core component of having something delivered, gaining credit or being insured.

In summary…

On the face of it, the basic idea behind locating duplicates is fairly simple, but as you can see, there are a large number of potential ways to do it, even within a keys based system.

Depending on the data and the desired outcome, then how you do it to get the best result can be altered and would be complex, if doing it without relevant software tools.

Read the full white paper here.