UPRN address matching algorithm: Difference between revisions

From Discovery Data Service
Jump to navigation Jump to search
Line 26: Line 26:
It does not take long to show the fundamental problem with address matching.
It does not take long to show the fundamental problem with address matching.


Consider the following candidate and standard addresses:
Consider the following candidate and standard addresses
<pre style="color:blue">


Candidate:Flat 1,15 high street, YO15 5TG
Candidate:Flat 1,15 high street, YO15 5TG


Standard:Flat 1,15 high street, YO15 5TG
Standard:Flat 1,15 high street, YO15 5TG
</pre>


By any computable measure, e.g. length of string, character position match, it can be seen that these two addresses are identical. One can easily deduce from this that there will be a high level of confidence that they refer to the same location. Also, as it is not possible for any other address to be more similar, a sound judgement can be made that this is the most likely from all of the addresses, ''except'' for the other identical addresses in the set.
By any computable measure, e.g. length of string, character position match, it can be seen that these two addresses are identical. One can easily deduce from this that there will be a high level of confidence that they refer to the same location. Also, as it is not possible for any other address to be more similar, a sound judgement can be made that this is the most likely from all of the addresses, ''except'' for the other identical addresses in the set.

Revision as of 08:27, 4 June 2020

Overview

Address matching algorithm uses human semantic based judgements as it's matching technique supported by a few machine based algorithms such as the Levenshtein distance algorithm

Firstly, some definitions of terms used

Term Description
Candidate address The address string submitted by a user or a subscriber system for matching
Standard address An address that an organisation, considered to be an authority, has stated as referring to a real location.

Matching a candidate address to a standard address consists of a process whose objective is to reach a high level of confidence that the candidate address refers to the same location as the standard address.

When attempting to match a candidate address to one standard address from a set of standard addresses there are two objectives. If both objectives are achieved, the address is said to be matched. The objectives are

  1. To reach a high level of confidence that the candidate address refers to the same location as the standard address.
  2. To judge that the standard address that has been matched to, is the most likely from all of the addresses in the set to refer to the same location.

It can be seen that the objectives include two relative measures, confidence and judgement. A question arises as to whether it is possible for these measures to be mathematically or statistically based, or not.

It does not take long to show the fundamental problem with address matching.

Consider the following candidate and standard addresses


Candidate:Flat 1,15 high street, YO15 5TG

Standard:Flat 1,15 high street, YO15 5TG

By any computable measure, e.g. length of string, character position match, it can be seen that these two addresses are identical. One can easily deduce from this that there will be a high level of confidence that they refer to the same location. Also, as it is not possible for any other address to be more similar, a sound judgement can be made that this is the most likely from all of the addresses, except for the other identical addresses in the set.

Consider the following:

Candidate:Flat 1 ,15 high street YO15 5TG

Standard:Flat 1, 15 high street YO15 5TG

One can see that they are different. Whilst both have the same characters, the position of character 7 and 8 have been transposed. Yet a human being would almost certainly say they refer to the same place. Also, unless there is another address that is exactly the same as the candidate, the match is the most likely.

Consider the following

Candidate:Flat 1,15 high street YO15 5TG

Standard:Flat 11, 5 high street YO15 5TG

Like the above example there has been a transposition at position 7 and 8. Yet they are obviously different addresses. They are very unlikely to refer to the same location unless there were no other flats or numbers on the street. Even If there were no other numbers or flats on the street then the confidence level may still only be moderate. Not enough to match?

So it can be seen that in fact it is semantics that drive matching judgements and not just positional variations and character mismatches. Likewise:

Candidate :15 Flat a High Street YO15 5TG

Standard:Flat a 15 high Street YO15 5TG

These two addresses are very likely to be the same unless a closer fit can be found. A closer fit is NOT a closer fit simply by character matching.

Standard:1 Flat 5a high street YO15 5TG is also quite close but clearly semantically different.

So it can be seen that in nearly all cases, when comparing one address with another, or an address against a set of addresses, it is the semantic interpretation of the address that wins out. Semantic interpretation is still more reliable than AI for language-based judgements.

It follows that address matching rules are nothing more than trying different addresses on for size and see which one a human being thinks means the same or a similar thing.

What does the computer do?

Firstly, we know from our own experience that words that are similar or the same words in different orders often mean the same thing. We also know that letters and numbers can be transposed without too much loss of meaning. We also know that misspelled words when corrected usually mean the same thing.

For example, we make a judgment that the term flr, in the context of an address, is likely to mean ‘floor’. In a cookbook though it might mean “flour”. Also we know that the same meaning can have different words or spelling such as 1st, and first. Often different punctuations means more or less the same 5/6 and 5-6 etc.

Knowing these rules means we can use the computer to manipulate words without changing the meaning as long as the manipulation takes place in reasonable bounds.

Rules can be quite clever. Knowing that a “top floor flat” is more likely to be the same as “flat c” from a list of a, b, c than “flat a”, gets a preference from a set of options.

Addresses have more than a dozen semantically different components. Name of a flat, number of a flat, number and letter of a flat, range of numbers for a flat, a building name, a building number, a building number and letter, a range of numbers with or without letters, a dependent thoroughfare, a street, a dependent locality, a locality, a town, a city and a post code. On top of this there are the descriptions in relation to the front or back of the building, the level within a building or whether facing east or west, north or south. There are many words to describe flat like units, maisonnettes, studios, houses, apartments and so on.

With address labels with up to around 10-12 semantically different field meanings and with human tendency to place words in the wrong fields and in the wrong order in the wrong fields, or inadvertent field separators means that field allocations cannot be relied on.

With each manipulation comes a human judgement. It is only human judgements that can be used to match addresses.

There are a few useful mathematical techniques. Pluralisation and de-pluralisation. Use of Levenshtein distance algorithm with allowable distances roughly proportional to the length of the words can help. Partial matching of words with front part matching (as humans often get the end of phrases wrong rather than the start) helps also.

However, these are just techniques to speed up what would otherwise be a manual process relying in eyesight, concentration and a high boredom threshold.

Conclusion. A computer only speeds up the process of manipulation that would otherwise by done by a human being, each manipulation being specifically undertaken with a view to enable a visual check that would result in a match.

If a particular manipulation falls short of something that a human knows would match, the manipulation is abandoned and another one tried. There is no useful machine-based algorithm based on a purely mathematical function. Whilst a mathematical function like transpositions may highlight likely character matches (and there is an association between character matching and semantic meaning) relying only this approach would reduce possible semantic matches achieved by larger distance comparisons.

The most likely successful algorithms are likely to be those that will have used pure persistence to capture and try on as many of the fits as possible.

To use an analogy. If you care about getting the right dress to be married in, be prepared to spend a lot of time and a lot of money trying on many dresses for size. Take a friend to point out the obvious problems. There is no other way.

Terms used

This table lists the main terms used in this specification

Term Description
Subscriber An organisation (or system that is operating on behalf of an organisation) that is seeking to access data from the Discovery service
Candidate address The address string submitted by a user or a subscriber system for matching
UPRN Unique property reference number as supplied by the ordnance survey organisation
Address based premium The set of databases as provided by the Ordnance survey holding the UPRNs and several addresses for each
Equivalent match The UPRN is deemed to be equivalent to the property as described by the submitted address
Sibling match The UPRN is assigned to a property nearby e.g. next door
Sub-property match The UPRN is assigned to a “higher level property than the submitted address.

For example if the submitted address wasflat 1 and the UPRN was assigned to flat then this would be a supra-property match

Sub property match The UPRN is assigned to a lower levelproperty than the submitted address.

For example if the submitted address was flat 1 and the UPRN was assigned to flat 1 a then this would be a sub property match

DPA file The Post office Delivery point address file
LPI file The Local Property identifier file