What’s address matching and why do we need it?
Departments, agencies and other service providers will often request and maintain their own address information. However, each organisation will capture slightly different information about an address or may store it in a particular format to suit their needs.
This means that address data is often inconsistent, is of varying quality, and can be held in multiple places. For example, Saltburn-by-the-sea can be written in many different ways including “Saltburn”, “Saltburn-by-sea”(with or without hyphens) and is often attributed to different counties.
Matching addresses is much easier if they’re written in the same way, so we need a process where lots of existing data can be matched against something definitive and trustworthy. That’s where address matching comes in.
Matching addresses is tricky. It’s conducted across government for a range of reasons, such as processing addresses found as text in existing data and documents or exchanged between systems as a part of the process of using GOV.UK Verify.
Transforming government’s use of address data together
Since so many government services and operations use addresses, this isn’t something for us to tackle in isolation.
We’ve been working with colleagues from other departments and recently held a small workshop to share insights, highlight challenges and coordinate our efforts. We were joined by teams from the Office of National Statistics (ONS), HM Revenue and Customs, Valuation Office Agency, Ministry of Justice, and the Health and Safety Executive (HSE).
We kicked the day off with a show-and-tell of what each team was working on and why address matching is so important. It was fascinating to see the progress the HSE has made with their matching algorithm and learn about the extraordinary challenge the ONS are tackling in refining high-quality address information to support the 2021 national census.
The afternoon borrowed from the lean coffee method to give teams an opportunity to discuss the issues most pressing to them. We covered everything from the ways of validating an address match to the more philosophical point of what we mean when we talk about an ‘address’.
What we learnt
This takes us back to the address matching workshop.
First of all, we learnt that we’re all trying to achieve the same thing and we’re looking to go about it in a similar way. It’s much easier to match messy addresses to those that have already been cleansed and corrected. This means that part of our efforts need to be directed towards developing an authoritative set of address data.
As it currently stands, there’s a lot of manual verification involved in address matching and human beings find it much easier to spot errors in words (such as street and place names) than they do in numbers (such as postcodes). Not only will authoritative address data make this manual verification easier, but the workshop also highlighted that it will help the development of better automated processes to reduce the need for intensive manual effort.
We also realised that we need feedback loops to tidy up address data as efficiently as possible. This means that end users providing or searching for an address within a service can correct errors at the point of entry which can then be fed back into an address database. With approximately 30 million addresses in the UK, this means the quality of address data will improve the more it is used.
Finally, we discovered that additional information can actually make addresses less specific and even more difficult to match. For instance, streets in villages are often also attributed to a nearby town to avoid having a blank “town” line in address, such as “Staithes” being sometimes attributed to “Saltburn”, even though it’s 9 miles away and falls within a different local authority. Matching an address such as “High Street, Staithes, Saltburn” can be harder than the equally unique “High Street, Staithes” because there technically isn’t any such place. In addition to this, the more information requested or maintained, the more opportunity there is to introduce error.
We’ve already started working on a shared vocabulary. ‘Address’ can mean slightly different things in different domains so it’s important to know whether we’re talking about a building, a Unique Property Reference Number (UPRN), a boundary, or the four lines and a postcode. We now have a collection of address terms that we’ll continue to iterate until we have agreed terminology across the group.
If we’re to crack address matching, we agreed that we’re going to need a comprehensive suite of test cases that exemplify some of the more difficult addresses to match. But this requires a high level of manual effort so we’ve created a public GitHub repository where each team can store and share these examples. It cuts down on the time and energy required and means that if one of the teams is able to successfully match one of the test cases, we can all learn from the approach.
There’s still a lot of work to be done and we’re already thinking about what we can cover in a follow-up session. The address matching workshop helped establish a community within the field and now all participants have a great forum for sharing problems and coming up with solutions together. If you’re working in government and you’d like to get involved in the community then just send me an email.