https://data.blog.gov.uk/2017/03/03/developing-a-government-data-science-sandbox/

Developing a government data science sandbox

Data science - the use of advanced techniques to understand information - is starting to take hold across government. The Better Use of Data team at GDS is helping departments grow their capability to use data and data science where they do not yet have the skills, tools or technology to do so.

Our support already includes the growing data science community of interest, our portfolio of collaborative data science projects, a better-defined career path, and the data science accelerator programme; as well as the first ever government data science conference in a few months. The aim is to build capacity and demand, ensuring government uses data effectively to make better decisions. To achieve this we also need to remove barriers to doing more good quality, impactful data science at scale. One such barrier, raised with us by data scientists and analysts, was the lack of easy access to data science tools - the software needed to perform complex analysis, present it in a digestible way and to share the underlying code.

We wanted to know how widespread that experience was, and what could be done about it. So as part of an initial discovery phase, we visited data scientists and analysts in over a dozen departments and agencies to ask, “what [technological] barriers do you face, and how can GDS help remove them?” Colleagues openly discussed the challenges they experience, as well as their achievements, helping to build a picture of what helps and hinders them.

User stories

Several themes emerged from these conversations.

‘Alice,’ an analyst in a government agency:

I want to use a broader range of data science software than is currently available, in order to progress from management reporting and the routine production of official statistics to improve understanding of my agency’s operations.

‘Bob,’ an analyst in a large department:

I use R fairly regularly, but need access to other data science tools to answer more sophisticated questions about administrative data that my department holds.

‘Carole,’ leader of a department’s data science team:

I need my team - and many of the department’s 100+ analysts whose data science skills are currently under-used - to have access to a bigger range of open-source software. This would help overcome restrictions from our IT provider that prevent the development of greater insights, which could be shared with policymakers and frontline practitioners.

As this selection of user stories shows, a key theme was that many data scientists and analysts can’t easily access the software they need - ranging from programming languages, databases, development environments and ways to share the analysis - using their department’s computers and networks. Much of this software is open source, but cannot readily be used within departments’ existing IT infrastructure. We can sum up the varied experience of accessing software into this user need:

As a civil servant wanting to use data science, I want to be able to try a variety of software tools using my government issued devices, in order to learn about their capabilities and build a business case for their use.

Data scientists and analysts also told us of other barriers. These include uncertainty around data sharing and security, and the difficulties of sharing their code with other civil servants, or more publicly, to help with quality assurance.

A sandbox solution

To address many of the challenges that came up in our discovery, we propose to build a cloud-based sandbox environment in which government analysts and data scientists can test new tools, see what approaches their peers are taking, share their work, get help, and develop effective business cases to procure the relevant software.

During this discovery phase, data scientists and analysts described some essential features that would make such a sandbox useful:

  • Browser-based access - to get around hardware/network restrictions
  • Secure/trusted sign-on - assuring users they are working in a safe environment
  • A data staging area to prepare data for analysis
  • Support for multiple databases and languages - access to currently-unavailable software, in a way that keeps packages up to date
  • Basic sharing tools, to help analysts share their work, get help and see others’ approaches

From our engagement with potential users to date, features that could be useful in future iterations include:

  • Deployment apps - to integrate the sandbox into existing IT infrastructure
  • Advanced publishing tools
  • Support for a wider range of software
  • Tutorials and/or a wiki to encourage best practice
  • Next steps

Taking the user requirements we’ve collected into account, we have started developing the first version of an analyst sandbox. We plan to base our infrastructure on similar cloud-based environments that are in the early stages of development within government (such as at the Ministry of Justice) or starting to be used (such as at the Met Office), and in academia, learning from colleagues about what works and what to avoid. We hope to have a simple alpha in use by the end of March 2017, with further iterations into the summer.

But we haven’t yet heard from all the data scientists or analysts who might make use of this sandbox, so email us and tell us what you would like the sandbox to do, and how else we can help support the growing use of data science in government.

1 comment

  1. narsing

    Data Science Sandbox is a good thought.
    Prateek Buch.

    Link to this comment Reply

Leave a comment