UPRN address matching algorithm

From Discovery Data Service
Revision as of 09:00, 4 June 2020 by DavidStables (talk | contribs)
Jump to navigation Jump to search

Overview

Address matching algorithm uses human semantic based pattern recognition, manipulation and matching judgement as it's matching technique, supported by a few machine based algorithms such as the Levenshtein distance algorithm

Firstly, some definitions of terms used:

Term Description
Candidate address The address string submitted by a user or a subscriber system for matching
Standard address An address that an organisation, considered to be an authority, has stated as referring to a real location.

Matching a candidate address to a standard address consists of a process whose objective is to reach a high level of confidence that the candidate address refers to the same location as the standard address.

When attempting to match a candidate address to one standard address from a set of standard addresses there are two objectives. If both objectives are achieved, the address is said to be matched. The objectives are:

  1. To reach a high level of confidence that the candidate address refers to the same location as the standard address.
  2. To judge that the standard address that has been matched to, is the most likely from all of the addresses in the set to refer to the same location.

It can be seen that the objectives include two relative measures, confidence and judgement. A question arises as to whether it is possible for these measures to be mathematically or statistically based, or not.

It does not take long to show the fundamental problem with address matching.

Consider the following candidate and standard addresses

Candidate   :Flat 1,15 high street, YO15 5TG
Standard    :Flat 1,15 high street, YO15 5TG

By any computable measure, e.g. length of string, character position match, it can be seen that these two addresses are identical. One can easily deduce from this that there will be a high level of confidence that they refer to the same location. Also, as it is not possible for any other address to be more similar, a sound judgement can be made that this is the most likely from all of the addresses, except for the other identical addresses in the set.

Consider the following

Candidate  :Flat 1 ,15 high street YO15 5TG
Standard   :Flat 1, 15 high street YO15 5TG 

One can see that they are different. Whilst both have the same characters, the position of character 7 and 8 have been transposed. Yet a human being would almost certainly say they refer to the same place. Also, unless there is another address that is exactly the same as the candidate, the match is the most likely.

Consider the following:

Candidate  :Flat 1,15 high street YO15 5TG
Standard   :Flat 11, 5 high street YO15 5TG

Like the above example there has been a transposition at position 7 and 8. Yet they are obviously different addresses. They are very unlikely to refer to the same location unless there were no other flats or numbers on the street. Even If there were no other numbers or flats on the street then the confidence level may still only be moderate. Not enough to match?

So it can be seen that in fact it is semantics that drive matching judgements and not just positional variations and character mismatches. Likewise:

Candidate :15 Flat a High Street YO15 5TG
Standard  :Flat a 15 high Street YO15 5TG

These two addresses are very likely to be the same unless a closer fit can be found. A closer fit is NOT a closer fit simply by character matching.

Candidate :15 Flat a High Street YO15 5TG
Standard  :1 Flat 5a high street YO15 5TG

These are also quite close but clearly semantically different.

So it can be seen that in nearly all cases, when comparing one address with another, or an address against a set of addresses, it is the semantic interpretation of the address that determines the match. Human based semantic interpretation is still more reliable than AI for language-based judgements.

It follows that address matching rules are nothing more than trying different addresses on for size and see which one a human being thinks means the same or a similar thing.

Pattern recognition and manipulation

What does the computer do? Basically it computerises the process of human pattern recognition and human suggested string manipulation.

Firstly, we know from our own experience that words that are similar or the same words in different orders often mean the same thing. We also know that letters and numbers can be transposed without too much loss of meaning. We also know that misspelled words when corrected usually mean the same thing.

For example, we make a judgement that the word 'flr', in the context of an address, is likely to mean ‘floor’. In a cookbook though it might mean “flour”. Also we know that the same meaning can have different words or spelling such as 1st, and first. Often different punctuation means more or less the same '5/6' and '5-6' etc.

From this knowledge we can set pattern recognition rules. We can say that full stops can be removed or '/' replaced with '-' because we know that they are unlikely to affect the semantics.

However, there are always twists. It is reasonable to infer that 'st' is likely to be short for 'street'. However, the phrase 'St Katherine's way' implies a saint, not a street. Therefore the manipulation rules must also take account of potentially wrong manipulations. In this case, a 'street' can be recognised by checking the resulting expansion e.g. ('high st -> high street') against a standard street index, or the position of the 'st' in the string. These patterns and manipulations are coded and validated and adjusted if wrong manipulations are discovered.

Human judgement based manipulations can result in false positives. As the algorithms are developed the introduction of false positives must be checked regularly and the manipulation rules adjusted.

Rules,based on knowledge of the world can be quite clever. Knowing that a “top floor flat” is more likely to be the same as “flat c” from a list of a, b, c than “flat a”, gets a preference from a set of options.

Address components

Addresses have more than a dozen semantically different components. Name of a flat, number of a flat, number and letter of a flat, range of numbers for a flat, a building name, a building number, a building number and letter, a range of numbers with or without letters, a dependent thoroughfare, a street, a dependent locality, a locality, a town, a city and a post code. On top of this there are the descriptions in relation to the front or back of the building, the level within a building or whether facing east or west, north or south. There are many words to describe flat like units, maisonettes, studios, houses, apartments and so on.

With address labels with up to around 10-12 semantically different field meanings and with human tendency to place words in the wrong fields and in the wrong order in the wrong fields, or inadvertent field separators means that field allocations cannot be relied on. manipulation rules take account of wrong allocations. In some cases users submit address with everything in one field. Manipulation must take account of no user determined address fields at all.

With each manipulation comes a human judgement that the resulting string is still a true reflection of the candidate and that a match to a standard would be correct. It is only human judgements that can be used to match addresses.

There are a few useful mathematical techniques. Pluralisation and de-pluralisation. Use of Levenshtein distance algorithm with allowable distances roughly proportional to the length of the words can help. Partial matching of words with front part matching (as humans often get the end of phrases wrong rather than the start) helps also.

However, these are just techniques to speed up what would otherwise be a manual process relying in eyesight, concentration and a high boredom threshold.

Conclusion on approach

A computer only speeds up the process of pattern recognition and string manipulation that would otherwise by done by a human being, each manipulation being specifically undertaken with a view to enable a visual check that would result in a match.

If a particular manipulation falls short of something that a human knows would match, the manipulation is abandoned and another one tried. The usual approach is to start with part of the address that seems to match, make small manipulations initially, then if no matches are found, increase the degree of manipulation. There is no useful machine-based algorithm based on a purely mathematical functions. Whilst a mathematical function like transpositions may highlight likely character matches (and there is an association between character matching and semantic meaning) relying only this approach would reduce possible semantic matches achieved by larger distance comparisons using human judgement.

The most significant problem is what to do when manipulations build on manipulations that are already dubious. I this case the level of confidence drops at this point and the process can start again from a different starting point, repeated until the algorithm author has judged that one set of manipulations give higher confidence than another. This results in a series of rankings for each set of manipulations. If the rankings are in the wrong order, then a premature match might be made, resulting in a false positive match. Adjusting the ranking uses the same type of judgement as adjusting the manipulation.

Algorithms

Terms used

This table lists the main terms used in this specification

Term Description
Subscriber An organisation (or system that is operating on behalf of an organisation) that is seeking to access data from the Discovery service
Standard address An address provided by an authoratitive body that has got an assured link to a UPRN
Candidate address The address string submitted by a user or a subscriber system for matching
UPRN Unique property reference number as supplied by the ordnance survey organisation
Address based premium The set of databases as provided by the Ordnance survey holding the UPRNs and several addresses for each
Equivalent match The UPRN is deemed to be equivalent to the property as described by the submitted address
Sibling match The UPRN is assigned to a property nearby e.g. next door
Sub-property match The UPRN is assigned to a “higher level property than the submitted address.

For example if the submitted address wasflat 1 and the UPRN was assigned to flat then this would be a supra-property match

Sub property match The UPRN is assigned to a lower levelproperty than the submitted address.

For example if the submitted address was flat 1 and the UPRN was assigned to flat 1 a then this would be a sub property match

DPA file The Post office Delivery point address file
LPI file The Local Property identifier file