https://data.blog.gov.uk/2016/10/28/technical-features-of-a-register/

Technical features of a register

On the registers team, we are building a product that provides teams with data that is good enough to build services on.  Registers are defined by certain characteristics such as being live data and being the only authoritative list for a thing. We thought it would be worthwhile exploring how these characteristics are actually implemented.

Technology and process

The operation of registers is underpinned by both technology and process. Not all of the characteristics of a register are linked to technical implementation.  Some are purely process-based.  For example, characteristic #1 is “Registers are canonical and have a clear reason for their existence”.  There is nothing stopping someone from building a system which has all the technical features of a register but which doesn’t have any reason to exist or is not canonical.  However, such a system would not satisfy the criteria for the register.gov.uk domain.  Ensuring that registers have a purpose is upheld by the process for creating new registers.

On the other hand, characteristic #5 is “Registers are able to prove integrity of record”.  This characteristic can not be fulfilled through policy or process means alone: the underlying technology of registers must provide features to generate cryptographic proofs.

In this post, we will distinguish between register characteristics which are supported by business processes, and those which are supported by the features of the technical implementation.

Registers are lists of data

Since registers are lists of data, and most data technologies support storing lists of data, it seems very easy to imagine a register as a table in a relational database, or as a simple key/value store, or as a simple triplestore.  Your favourite data storage technology could easily store the contents of a register.

Furthermore, each register has a constrained set of fields that can be used.  For example, the country register has fields such as `country`, `name`, and `official-name`.

Process: Registers are reliable

Our introductory guidance on registers says that each register is the most reliable list of its kind. Why might data not be reliable? What causes problems with reliability?

One problem is that data might not be authoritative - it might not be published by the entity responsible for maintaining that data.  For example, a Land Registry title might refer to a limited company by name, but the body officially responsible for incorporating (and dissolving) limited companies  is Companies House. That makes Companies House the authoritative source of this data.

Non-authoritative data is, by definition, duplicated.  Whenever data is duplicated, it introduces possibilities for error: typos, transcription errors, and so on.  Furthermore, because the non-authoritative publisher is not directly responsible for this data, it is unlikely to notice or fix these errors.  Non-authoritative data tends to become unreliable over time.

To fix this, custodians should publish only data that they are directly responsible for: they should publish minimal datasets.  Where there is a desire to publish data where another body has authority, this should be published as a link to the authoritative source rather than as a non-authoritative copy.  For example, rather than publishing a name of a limited company, you could publish a link to the Companies House URL for the particular company.  Or you could publish the company number and the user can use this unique identifier to look up the company records at Companies House.

Therefore, in order to fulfil the policy that registers are reliable, registers must support linking to data held by other organisations.

Feature: Registers support linking between organisations

Registers must support linking data together.  Links are a way of removing duplication.  For example, in the relational database world, duplication can be removed from a table by a process of normalisation: breaking the large table up into multiple small tables, linked to each other using foreign key references.

Registers must support linking to data held by another organisation.  For example, a register of approved Digital Marketplace suppliers (maintained by the Crown Commercial Service) might link to the register of limited companies (maintained by Companies House).  The register of Digital Marketplace suppliers is a list showing which companies may trade on Digital Marketplace, but it defers to the authority of the register of companies to show the official names and officers of a company.

Feature: Registers provide guarantees of integrity

Registers have a cryptographic proof of integrity.  If you have a record from a register, you should have some way of understanding that this data comes from a register, and demonstrating that nobody has tampered with the record through malice or otherwise.

We have previously written about our exploration of integrity guarantees.

Feature: Registers are append-only

An item in a register is never modified; instead, a new entry is added to the register marking an update to an existing record.

This also means that a register has all historical data in it.  For example, at the time of this blogpost, the Gambia has had three different versions in the country register.  The country register provides pages for the current version of the record as well as a list of all versions of the record.

Not a feature: Registers do not provide ad hoc querying

One feature that registers do *not* provide is ad hoc querying.  Some database technologies provide rich query languages that allow making domain-specific data queries.

Making a general-purpose query API open to the public internet is fraught with danger.  A sufficiently advanced API will allow users, by accident or malice, to make queries that consume a large amount of resources. This can result in denial of service, or very slow response times, impacting the experience of other users.  Registers allow simple lookups, but for more complex domain-specific queries, users can download the dataset and put it into their own index.

Summary

There are many policies and many technical features which are required to create a register.  In this blog post, we’ve described some of the necessary technical features.

However, when we talk about the technical details, we often get asked whether a register is like a particular data technology such as a relational database, a graph database or a distributed ledger.  In the next post, we will compare registers with some of these other technologies.

1 comment

  1. exstat

    I do wonder sometimes whom the language in these blogs is aimed at. A register is "canonical? Registers have a cryptographic proof of identity? I do not know what these phrases mean. To be fair, as someone well retired I do not need to know but if you would like an audience beyond data scientists you could use a bit less jargon!

    I also worry that data science is redefining words to mean what you want them to mean and leaving more traditional interpretations behind. My usual example is the Interdepartmental Business Register at ONS. This used not to be (and probably still isn't) open data, which the blog almost seems to assuming a register must be, for confidentiality reasons. If you mean that for an open data product to be regarded as a register it must meet certain criteria, that's fine, but that doesn't make everything else a non-register.

    The IDBR is not an authoritative list of businesses; no such register is ever likely to cover all the nanobusinesses too small to feature in VAT or PAYE systems. Not all of these are companies, of course. It includes trading (but not non-trading) companies. There is also the distinction between UK activity and overseas activity for a registered company to contend with. So there is a link between part of the IDBR and Companies House data but not necessarily one that gives you exactly what you need. CH, VAT and PAYE systems provide part of the means by which business structures are set up on the IDBR for statistical purposes.

    The implication of the blog is that the IDBR should not hold any information from these other data sources apart from the key field needed to link to them. So every time someone draws a sample of businesses, a huge number of calls to those other data sources would be set in train. With the scale of what is done, wouldn't that generate precisely the kind of overload that you say would rule out an analytical service?

    The "append only" aspect is also interesting. While it is usually sensible to keep records of the transactions that have updated a register, it may not always be worth it in business terms, or it may be sensible to keep those transaction details away from the main register. It depends partly on the purpose, the degree of openness and the sheer scale of the transactions. I would imagine that a business's record is updated on the IDBR each time it is sampled, for example, not least to meet various restrictions on how often businesses are sampled. That could be several updates a week! For an open register the uses cannot be predicted in advance (would anyone have guessed how fundamental postcodes would become when they were first introduced?) so having the history visible may well be necessary. If it is for an internal business purpose it may not be and I don't think it is up to data science definitions to dictate that it must be.

    I recognise that I am probably getting close to dinosaur status in this game, But, my question is, would you chaps regard the IDBR as a register or not?

    Link to this comment