https://data.blog.gov.uk/2015/09/28/tidying-time-series-data/

Tidying time series data

Last week we had a blitz of tidying the data.gov.uk catalogue, focussing on data in time series. We've not been happy for a while that data that is released as monthly updates is sometimes added to data.gov.uk as separate datasets. So we tracked down as many of these errant ones as we could and 'rolled up' each time series into a single dataset.

For example these are simply monthly updates:

Monthly updates as numerous datasets

When this data is split up like that, it clogs up the search results. It makes it hard to find a particular month or download a batch of them.

After our tidying there is only one dataset, and it looks like this:

Monthly updates in a single dataset

We've not only merged the records into one, we've also parsed the dates so that we can present the monthly files in chronological order and hide the older ones by default.

It's now clear that this publisher's spend data is up-to-date for this year, which is good.

In all we've merged almost 2000 datasets from about 60 publishers. The vast majority has been NHS bodies' spend transaction data, although that is still a small fraction of the 400-odd NHS bodies publishing this data on data.gov.uk. If we've missed any time series, then publishers can contact us and view the guidance.

One more thing - because each time series' files are in a single place now, we will shortly provide a one-click download to get a zip file of all the months' data together. For that new feature, watch this space...

3 comments

  1. Comment by exstat posted on

    I would query the use of the phrase "time series".  Strictly speaking, a time series is a long series of observations of something, not a series of lists of everything that happened in a month.  Rolling them up together is a good move for this sort of data but it must not be applied to "proper" time series, as back data are often revised when new ones come out.  It is important sometimes that people have easy access to the time series as published a year ago (say) rather than just the latest series, and if they cannot readily identify the date of release they cannot do that.

    What phrase would I use in place of time series for this sort of dataset?  Monthly report, perhaps, because that is all it is.

    Reply
    • Replies to exstat>

      Comment by davidread posted on

      You're no doubt right in a strict statistical sense, but in general use "time series" can also mean "series of events":

      http://www.oed.com/view/Entry/304635

      Regarding revisions to data, then there is no trouble for publishers to add extra URLs for this data, alongside the original version. For example:

      https://data.gov.uk/dataset/annual_survey_of_hours_and_earnings

      where under 2010 you'll see both the original figures and the revised ones.

      Reply
      • Replies to davidread>

        Comment by exstat posted on

        David

        I actually quite like what you have done with ASHE, with the exception of the word "time series".  Is that something that has come from the original publishers (ONS) or is it your own embelishment?

        The OED does indeed give as an alternative meaninbg "the sequence of events which constitutes or is measured by time" but it is obvious from the associated quotes that it is being used in a philosophical or theological context (C S Lewis, Swinburne...).  That is not general use!  And it's nothing like a series of bundles of events which happen to fall in particular periods of time!  If that is creeping into general use, it should be resisted.

        Going back to ASHE, I think it is a misuse of the phrase "time series" to describe the 29 huge datasets.  Users may be able to build up thousands of different time series themselves by extracting data from each of the 29 datasets.  But I would argue strongly that a time series is a series of values obtained ast successive points in time and presented as such. The Index of Average Earnings is a time series. ASHE is not.

        In my original reply, by the way, I was thinking more of a time series like GDP, which is frequently revised back to the year dot. What I would expect to see in that context is something which clearly describes and offers links to the GDP time series available now, the series available last month, the series available the month before that and so on.  What I actually get if I search for GDP on data.gov.uk is a set of links to different data sources. One such link is the second estimate of GDP. Under that I see 32 resources from Q1 2008 to Q2 2015.  It would not be clear to the casual user whether "Q1 2008" is offering data for Q1 2008 alone (as ASHE offers data for 2010 alone, say) or a set of time series as published in Q1 2008, or (and this is the most likely) a set of time series published with the first "second estimates" made for Q1 2008 (in May 2008?).  Nor would I have any idea what the significance of "second" is and how it relates to previous or subsequent estimates, which are presumably hiding under different headings. So I certainly do not get something that tells me about all the GDP time series that have been published, or something that warns me that the latest "second" estimate may already be out-of-date.

        I've got no idea whether it looks any better (or worse) now on the National Statistics website. My recollection from my working days is that you basically got a list of relevant releases, though not necessarily the lot.  So I'm not suggesting that it is easy to get this sort of thing right.  What does worry me a bit is that quite different sorts of datasets or data items are being forced into a straitjacket which could in some cases actually make life more difficult for users.  The purloining of well-defined phrases like "time series" and the tendency to badge the quality of data sources entirely on how open they are,without making it plain that other aspects of quality which matter a lot (such as accuracy!) are not covered seem to tell me that data science is dominating  what happens more than perhaps it should.

        Don't get me wrong; I remain enthusiastic about the open data agenda, and bringing together successive generations of outputs has to be a good thing in general but I think more care is needed on how it is presented.

        Reply

Leave a comment

We only ask for your email address so we know you're a real person