Visualising high risk areas for Covid-19 mortality

Colin Angus recently demonstrated various visualisations that he had created for Covid-19 mortality on Twitter. Here he elaborates on his approach to this work.

 

 

Sometimes the best and most interesting ideas come from seeing a new application of other people’s work.

By mid-March, emerging data from Northern Italy clearly showed that COVID-19 fatality rates were substantially higher in older age groups, particularly for men. Demographers Ilya Kashnitsky and José Aburto combined this data with data from EUROSTAT on the age-sex distribution of the population across European regions and published a fascinating pre-print.

This displayed the potential risk that each area faced from a large-scale COVID-19 outbreak. Areas with large populations of older men, such as parts of former East Germany, faced an expected mortality rate more than four times greater than areas with younger, more female populations, such as south-eastern Turkey.

We are blessed in the UK with some wonderfully rich data, including estimates of the population structure at very low levels of geography, right down to Lower Super Output Area (LSOA) level. I thought it would be interesting to replicate this approach to calculate potential COVID-19 exposure for LSOAs in England.

This was relatively straightforward using ONS population data, data from the Imperial College modelling study that estimated age-specific Infection Fatality Rates (IFRs) from COVID-19 and sex-specific Case Fatality Rates from Northern Italy, published by the Italian Statistical Institute ISS. There are a lot of LSOAs in the country (almost 35,000), so I decided to visualise my results for Sheffield, using an LSOA shapefile from the ONS Open Geography portal.

This immediately shows some huge variations in potential exposure between the central parts of the city in the middle of the map, with expected mortality rates of less than 100/100,000 even if everyone was infected, and the leafy suburbs in the south-west with rates of 2,400/100,000.

These results seemed potentially useful to help plan local Public Health responses to the pandemic, but something jarred with me about the fact that many of the LSOAs showing high mortality risk are among the most affluent in the entire country, while many of the most deprived LSOAs were identified as low-risk.

In addition to the evidence on age-sex risks of COVID-19, it was becoming very clear that people with pre-existing health conditions were at considerably greater risk of death.

As part of the calculation of the Index of Multiple Deprivation, the Ministry of Housing, Communities & Local Government (MHCLG) calculate a ‘health and disability deprivation’ score which reflects the levels of ill health and rates of hospital admissions within each LSOA. I wondered how the COVID-19 mortality exposure measure might be related to this measure of health, and so I brought in IMD data from the MHCLG Open Data portal.

I expected the relationship between health deprivation and the exposure measure to be complex, since greater deprivation is associated with poorer health, but lower deprivation is associated with older age, which is also associated with poorer health. This complexity was borne out when I plotted the relationship between the two for every LSOA – with a clear correlation between lower health deprivation and higher age-sex risk, but enormous heterogeneity between LSOAs within each deprivation decile.

This plot suggested that Public Health activities might be best concentrated in areas in the bottom right, where health is poorest (on average) and there are more older people, particularly men. But how could I best visualise these areas?

In the end, I decided this was a perfect candidate for a bivariate map. These are a great way of visualising the joint spatial distribution of two variables, which work best when you are particularly interested in picking out the outliers in your data – either areas with high levels of both variables, or high levels of one and low levels of the other.

In this case I wanted to pick out areas with high levels of health deprivation and age-sex-specific risk. For more background on this sort of map, there’s a nice overview here. Here’s what the bivariate map for Sheffield looks like:

This map matched my intuition much better – young areas with poor health in the north west are clearly picked out, as are older areas with good health in the south west.

At the same time we can identify a relatively small number of areas for specific concern where the mortality risks from a large-scale COVID-19 outbreak are particularly high. Because many people won’t be familiar with this kind of two-dimensional colour scale, I generally try and add a few annotations to bivariate maps to help guide people’s interpretations.

The final step was to work out how to make these maps accessible to people working in Local Authorities around the country.

I posted the R code that I had used to make the map, in as user-friendly a form as possible, on GitHub so that people could easily create their own maps, but that still felt quite limiting. So I built a Shiny app. This was quite an adventure, because I’ve not used Shiny before, but it ended up being a lot easier than I initially feared.

The trickiest thing was working out how to get the huge LSOA-level shapefile onto Shiny’s hosting platform. In the end I used the excellent mapshaper tool to simplify the polygons in the shapefile until the whole thing was small enough. That’s why the maps in the app look much ‘blockier’ than the ones from the original R script.

In order to satisfy my aesthetic sense, I also made some large composite maps for a few major cities which are spread across multiple Local Authorities, such as Greater Manchester.

At the suggestion of Ilya Kashnitsky I made these maps slightly transparent and added some background road features using stamen maps to help place the various areas on the map in context. I’ve shared these maps with Public Health colleagues in various parts of the country and hopefully they were useful in helping to plan the early phases of the pandemic response.

You can find the R code used to generate these maps and the other plots in this blog here.


Colin Angus (@VictimOfMaths) is a Senior Research Fellow in the Sheffield Alcohol Research Group within ScHARR.

His work focuses on the design, development and adaptation of complex health economic models and their use to appraise key policy questions in the field of alcohol research. The majority of his research is based on the development of the Sheffield Alcohol Policy Model to incorporate new methodological developments, new data and to answer new policy questions, both in the UK and internationally.

British Red Cross Covid-19 Vulnerability Index Map

The British Red Cross have pulled together a really interesting and relevant index to attempt to focus help on people who are most vulnerable to contracting Covid-19.

Their Covid-19 Vulnerability Index comprises these vulnerabilities:

  • Demographic
  • Clinical
  • Other health/wellbeing needs
  • Economic
  • Social
  • Physical/geographical isolation

Within these vulnerabilites, a range of indicators have been assessed using data from a range of providers. As part of this process, the researchers have created a bespoke version of the Index of Multiple Deprivation to cover the whole of the UK (normally Indices of Deprivation are calculated for the four individual countries of the UK). This in itself is an interesting project to have undertaken.

One of the datasets, the researchers used was the Labour Force Survey. They also used 2011 Census data from the three separate UK census agencies, although they may not have been aware that harmonised UK data is available from the UK Data Service.

As this is a work in progress, the researchers behind the vulnerability index have also identified additional vulnerabilities which could be considered as part of the index.

The researchers have produced a detailed document which outlines the approaches they have taken to compile the vulnerability index.

In additional, they have produced maps at the following levels:

They have also made their code publicly available.

This is a fantastic resource and I certainly hope it will help in allocating resources where they are most needed to support those most at risk from Covid-19.

Covid-19 and data visualisations

The Covid-19 coronavirus and its impact on society are very much in the news and in the forefront of many people’s minds right now.

Amongst the coverage there has been debate on whether countries should be using quarantines, lockdowns or ‘social distancing’ as part of their approach to slowing the spread of the virus.

I came across this Washington Post article which uses animated visualisation to explore transmission and recovery rates under different approaches. The article admits that the visualisations have simplified a complex social and health issue, but still give an approximation of how the different approaches affect populations over time.

Snapshot of one of the simulations (c) Washington Post

Snapshot of one of the simulations (c) Washington Post

The simulations are also random, so will vary for each visitor to the web page, echoing the way that transmission and recovery will not only vary by approach taken but by various factors.

We’d be very interested to hear of any visualisations or innovative use of data to represent and help people better understand the current situation. Let us know in the comments below if you find any!

Learning R or How learning to programme changed my research career

We’ve asked our #DataImpactFellows to share their ideas about ‘change’. 

 David Kingman tells us about how his experiences of learning to programme in R have changed his research career.

 

 

Over the past two years, a very big change has occurred in my working life which seemed like a natural subject for me to write about, which has been learning to programme using R. Prior to this I had virtually no experience of programming, whereas now I use programming to achieve things practically every day when I’m at work. Apologies in advance if any of what I write below is really obvious!

What is R?

R is an open-source programming language which is mainly designed to be used for data analysis and data visualisation. Being open-source, R is free for anyone to use, because all of its features have been developed by an army of thousands of volunteer developers who have created “packages” (collections of functions which extend R’s basic set of capabilities) over the past couple of decades. 

More background information about R can be obtained from the R Foundation Website, or from R’s wikipedia page. R is generally regarded as one of the “big two” open-source programming languages which are currently being widely used for data science, alongside Python, which is its main competitor.

Although R itself is a language, most R users interact with R using RStudio, which is an open-source integrated development environment (IDE) which enables you to write and execute code and store the datasets which you’re working on in one place. Their open-source nature means that getting hold of R is as simple as downloading R and Rstudio from their respective websites, and then learning how to use them.

My initial introduction to R was an example of necessity being the mother of invention. By about mid-2018 I had undertaken a number of successful research projects during my time at the Intergenerational Foundation  (IF), but the fact that we were a small charity meant that we were limited in the tools we had at our disposal, and therefore the kinds of analysis that I could undertake. 

Up to that point, my two main tools as a researcher had been Excel and the open-source GIS application QGIS, which I had used to visualise spatial data (for example, I used both to produce Generations Apart, my 2016 report which looked at the growth of age segregation in England and Wales over the previous 25 years). 

I was keen to undertake more ambitious research projects, but (like many small charities) IF didn’t have the budget to be able to afford the licenses for expensive proprietary statistical software such as SPSS or Stata. However, at about this time I happened to have a conversation with a friend who is a web-developer who encouraged me to learn a programming language, and I also heard about R during a presentation at an academic conference that I was attending on behalf of IF, which convinced me to give learning R a try.

How did I learn R?

The short answer is that I learnt R by practising using it, as would be the case with learning any other new skill. Although I had taken several statistics modules at university, these were restricted to looking at data using SPSS, so analysing data using code was completely new to me.

Although there seems to be a general perception among R users that R has quite a steep initial learning curve, I think that once you’ve got the hang of a set of basic concepts it actually becomes much easier to analyse data using code than it is via GUI (graphical user interface) software. 

Initially, I did a Lynda.com course on R programming, and since then I’ve read a number of books on R, of which I think the most useful was R for Data Science by Garrett Grolemund and Hadley Wickham. There are also a lot of very good online courses which teach you how to use R, particularly those provided by Datacamp.

However, I found that the way I learned fastest was just to try and do things with R, and to persevere with the obstacles I encountered until I got the desired result. There is a vibrant online community of R users who are an endless source of useful advice whenever you get stuck on something; a great place to find the answers to many R-related questions is Stackoverflow, an online forum where people post programming queries which the user community provides answers to.

What do I use R for?

As there are now over 10,000 packages which have been made for R, it’s possible to do almost anything that you want to do with data using R, from data wrangling to visualising data, to building models and and working with machine learning and AI. 

Whenever I embark upon a research project, I now use R to manage my entire workflow from beginning to end (which now often includes writing-up and presenting findings from my analysis using R Markdown instead of Word). This has a number of advantages over how I was doing things previously, some of which I’ll outline briefly:

  • Reproducibility – I can copy code which I’ve already written from one project to another; this means that once I’ve worked out how to do something once, I can do it again almost instantly, or with only minor alterations, which saves a lot of time and work. It’s a good idea to learn how to use a version control system such as Github with R so that you build up an online repository of code that you can reuse for future projects.
  • Efficiency – I can now also work with much larger amounts of data, or with multiple data sets (for example, multiple years of data from a long-running social survey) more efficiently, because using loops or apply functions in R makes it easy to apply the same set of functions over multiple datasets rather than doing so one at a time.
  • Error detection – Working with code makes it much easier to spot where I’ve made mistakes during my project workflow because I have a record of what I’ve done at each step the process I’ve followed, which means I can go back and change things without having to laboriously repeat all of the subsequent steps in the analysis.
  • Encouraging experimentation – One of the biggest benefits of working with R is that I’ve found it encourages me to be much more experimental when it comes to interrogating my data for insights, because it’s so straightforward to go back a step in my analysis and re-write my code if I attempt something which doesn’t work or doesn’t produce the desired result. This is a huge improvement compared with having to laboriously repeat many steps in the way you would do with a GUI software programme, and greatly reduces the risk that new mistakes will creep into my work while I’m retracing my footsteps.
  • Visualisation – R is especially renowned for its data visualisation tools (particularly the famous ggplot2 package), which are so powerful that a number of major organisations now produce all of their data visualisations using them, including the BBC and the New York Times.

On that last point, one of the first big things I did with R was to create a custom template using ggplot2 for making data visualisations which carry IF branding (such as this one from our recent research into how what young adults spend their money on has shifted over time), which has enhanced our brand-awareness and made it easier for us to publicise our research on social media.

Although I still have much more to learn about R (and I’m also now learning SQL, another programming language, as well), I’ve found the experience of learning how to programme absolutely fascinating, and I’ve already felt the benefits from it within my research career, as I’ve recently started working part-time as a Senior Research and Statistical Analyst in the Demography Team at the Greater London Authority (alongside my role at IF), where the entire team uses R for the vast majority of its data analysis and visualisation projects.

Overally, I would thoroughly recommend learning R to anybody who works with data, as its now widely used within academia, the public sector and the commercial world, and because it is both freely accessible and tremendously powerful once you’ve overcome some of the initial learning-curve. 

 


David Kingman is one of the UK Data Service Data Impact Fellows 2019. He is the Senior Researcher at the Intergenerational Foundation and a Senior Research and Statistical Analyst in the Demography Team within the Greater London Authority Intelligence Unit.

David is a quantitative senior researcher and data analyst with a wide range of interests including population demography, economics, inequality, housing, pensions, higher education, political representation and wellbeing in his current role as Senior Researcher at the Intergenerational Foundation (IF). The IF is a non-party-political think tank which researches intergenerational fairness.

David is frequently invited to present IF’s work at conferences, seminars and roundtable discussions, and has appeared as an expert witness before select committees in both Houses of Parliament, at All-Party Parliamentary Groups and before the Low Pay Commission on two occasions. In June 2018, he addressed a large audience at the European Parliament as part of the 2018 European Youth Event.

 

The code underlying the Social Metric Commission’s new poverty measure

Emily Harris and Matthew Oakley from the Social Metrics Commission’s Secretariat introduce the release of the code underlying the Commission’s new poverty measure.

Who is the Social Metrics Commission?

The Social Metrics Commission is an independent commission founded in 2016 to develop a new approach to measuring poverty in the UK. Led by the CEO of the Legatum Institute, Baroness Stroud, the Commission’s membership draws on a group of top UK poverty thinkers from different political and professional backgrounds. Currently there is no agreed UK Government measure of poverty and the Commission’s goal is to provide a new consensus around poverty measurement that enables action and informs policy making to improve the lives of people in poverty.

Our landmark report in 2018 proposed a new measure of poverty for the UK and, since then, we have received support from across the political spectrum. The Government has now committed to developing experimental national statistics based on the Commission’s approach.

Aside from analysts in government, we want our measure to be used far and wide, from researchers to charities, media, and policymakers. That is why we have made the code that underpins the creation of the poverty metric freely available to download from our website.

What has the Commission produced?

The Commission’s measure of poverty goes beyond conventional metrics that look only at incomes by also accounting for the positive impact of people’s liquid assets (such as savings, stocks, and shares) on alleviating immediate poverty and the range of inescapable costs that reduce people’s spending power. These inescapable costs include rent or mortgage payments, childcare, and the extra costs of disability.

As well as reporting on the number of families in poverty, our approach also seeks to understand more about the nature of that poverty and families’ experiences. It provides measures of the depth and persistence of poverty, as well as including a range of Lived Experience Indicators that capture issues such as mental and physical health, employment, isolation, and community engagement.

By taking all of these things into account, we believe the new metric reflects more accurately the realities and experiences of those living in poverty than previous measures. If taken up as an agreed measure, it would allow the Government to take meaningful steps to reduce poverty and improve outcomes for those who do experience it and to track the success of these policies.

How can you use the Commission’s work?

The code underlying the Commission’s measure of poverty, along with detailed user guides, can be easily downloaded from our website. It is written in a series of STATA do files and draws on data from the Family Resources Survey (FRS)Households Below Average Income (HBAI), and Understanding Society.

Once the folder for the code is downloaded, users will see that it is arranged in two separate collections of do files according to the data source being used. The first set of code draws on Family Resources Survey (FRS) and related Households Below Average Income (HBAI) data to operationalise the Commission’s core measure of poverty, measures of depth, and selected Lived Experience Indicators. The second set of code draws on Understanding Society data to operationalise the Commission’s measure of persistent poverty and the remaining Lived Experience Indicators.

Within each folder there are a series of individual do files that are coordinated by one master Command File. When you run the Command file it will execute each individual do file, successively building the measure and then producing final results. These results are stored in an Excel spreadsheet that is automatically produced when running the code.

There are also a number of modification options that users can explore in the Command file. You can specify, for example, the name of the data cut you are running so that files are named accordingly, and you can set which years of analysis or which country within the UK the measurement should apply.

Our hope is that analysts and researchers will download our code and use it to replicate our analysis, but will also extend it to further analyse UK poverty based on the Commission’s approach. We look forward to seeing what additional insights others can discover by using our code and building on our analysis.


Matthew Oakley is the Head of Secretariat for the Social Metrics Commission and Director of WPI Economics, an economics and public policy consultancy. He is a respected economist and expert on welfare reform and the future of the welfare state.

Before founding WPI Economics Matthew had been Chief Economist and Head of Financial Services Policy at the consumer champion Which?, Head of Economics and Social Policy at the think tank Policy Exchange and an Economic Advisor at the Treasury. He has an MSc in Economics from University College London, where he specialised in microeconomics, labour markets, public policy and econometrics.

 

Emily Harris is a Senior Analyst for the Social Metrics Commission and is based at the Legatum Institute. She most recently worked in the Social Policy Section at the Commonwealth Secretariat.

Originally from South Africa, Emily was at the University of Cape Town’s Poverty and Inequality Initiative, where she managed a project developing indicators for youth well-being at the small area level. In this post she played a leading role in constructing an index for multidimensional youth poverty and setting up the Youth Explorer data portal. She has also worked as a data analyst on two cash transfer studies in India and consulted as a data manager at a private research company. Emily graduated cum laude with a Masters in Development Studies from the University of KwaZulu-Natal.

How can we measure how green our towns and cities are?

We’re all much more aware these days of the environment we live and work in, and recognise trees’ ability to help absorb some of the excess carbon dioxide from our atmosphere.

Areas including  Greater Manchester and Northumberland, and organisations like Trees for Cities are actively trying to increase the number of trees and people’s connection with them.

There are various approaches people have developed to work out how physically green a city or town is.

The OECD gathers international data on Depletion and Growth of Forest Resources, open access data available from the UK Data Service.

A couple of years ago, the BBC website highlighted a tool which had been devised to map four types of land use in local authorities: farmland, natural, built on and green urban.

The ONS Data Science Campus has also developed algorithms for mapping the urban forest at street level.

I was intrigued then, when I read a recent article in the Guardian: Green streets: which city has the most trees? about another approach to mapping how tree-filled cities were.

The two particularly interesting things? Firstly, that the work had been based on Google Maps Street View data and secondly that the team at the MIT Senseable City Lab made the code open source.

The researchers at MIT (in collaboration with the World Economic Forum) developed Treepedia, which currently analyses and maps the above-ground tree cover in 27 world cities (with more promised).

Treepedia map of Sydney, showing Green View Indicator percentage

Where other approaches have used satellite mapping, the MIT researchers calculated what they term the ‘Green View Index’ (GVI) using Google Street View (GSV) panoramas.

Taking this approach meant the researchers were able to represent how people perceive their environment at street level. Their GVI uses a scale from 0-100, representing the percentage of canopy coverage at any particular location.

The team go into detail about their methodology for creating the GVI in several papers, including their most recent:  Mapping the spatial distribution of shade provision of street trees in Boston using Google Street View panoramas.

One particular quirk of the team’s chosen method for calculating a Green View Index for an area is that it is (as they acknowledge) limited by where Google Street View vehicles can access.

This leads to the interesting side-effect that their Treepedia map of the Manhattan part of New York has a large dark space where Central Park is. That said, this approach does highlight the amount (or lack thereof) of trees on town and city streets, which in itself is an opener for discussion.

Treepedia map of Manhattan, New York showing the 'empty' space of Central Park

While Treepedia covers a limited number of international cities currently, the team are keen to share their code and have released an open-source Python library that can be used to compute the Green Value Index for any city or region. You will need a GIS file for the street network of your chosen area and a Google StreetView API key. The library can be downloaded from GitHub.

Have you tried mapping how green your area is? Let us know in the comments below.

QAMyData: A health check for numeric data

Louise Corti reports on the QAMyData tool recently developed by the UK Data Service. The tool is a free easy-to-use open source tool that provides a health check for numeric data. It uses automated methods to detect and report on some of the most common problems in survey or numeric data, such as missing data, duplication, outliers and direct identifiers.

Why is a checking tool needed?

Social science research benefits from accountability and transparency, which can usefully be underpinned by high quality and trustworthy data.

It can be a challenge when curating data to locate appropriate tools for checking and cleaning data in order to make data FAIR. The tasks of checking, cleaning and documenting data by repository staff can be manual and time-consuming.

Disciplines that require high quality data to be created, prepared and shared benefit from more regularised ways to do this.

The objective of this grant was to develop a light weight, open-source tool based on open-source software for implementation of quality assessment of research data.

This can be viewed as a ‘data health check’ that identifies the most common problems in data submitted by users in disciplines that utilise quantitative methods. We believe this could be appealing to a range of disciplines beyond social science where work involves surveys, clinical trials or other numeric data types. Furthermore, a tool that could be easily slotted into general data repository workflow would be appealing.

How were quality checks selected?

Requirements were gathered around the kinds of checks that could be included through a series of engagement exercises with the UK Data Service’s own data curation team, other data publishers, managers and quantitative researchers.

A number of repositories were invited to online meetings to discuss their data assessment methods. Common examples included:

  • column and row (and case) issues
  • missing values
  • missing or incomplete labels
  • odd characters
  • obviously disclosive personal information.

A comprehensive list of ‘tests’ was produced to include the most commonly used when quality assessing numeric data files.

Next the team worked on appropriate methods of assessment, how to set up input controls and consider reporting thresholds. For example, what threshold might constitute a ‘fail’?

A critical feature of the tool was that a user should be able to specify and set thresholds to indicate what they are prepared to accept, be it no missing data or data must be fully labelled.

Issues would be identified in both a summary and detailed report, setting out where to find the error/issue (for example, the column/variable and case/row number).

This at-a-glance report aspect is appealing for data repositories, to help quickly assess data as it comes in, instead of relying on manual processes that are often a large part of the evaluation workflow. An early plan was also that the system must be extensible to add new tests.

The types of checks were broken down into four types:

File checks

File opens Checks whether acceptable format
Bad filename check, regular expression via RegEx pattern Regex requires quotes “[a-z]”. To use a special characters, e.g. a backslash (\) a backslash before is required e.g. \\

Metadata checks

Report on number of cases and variables Always run
Count of grouping variables
Missing variable labels Must be set to true. If set to false the test will not run
No label for user defined missing values e.g. – 9 not labelled SPSS only
‘Odd’ characters in variable names and labels User specifies the characters
‘Odd’ characters in value labels User specifies the characters
Maximum length of variable labels, e.g. >79 characters User specifies the length
Maximum length of value labels, e.g. >39 characters User specifies the length
Spelling mistakes (non-dictionary words) in variable labels using a dictionary file User specifies a dictionary file.
Spelling mistakes (non-dictionary words) in value labels using a dictionary file User specifies a dictionary file

Data integrity checks

Report number of numeric and string variables Always run
Check for duplicate IDs

User specifies the variables. Multiple variables can be added on new lines e.g.

– Caseno

– AnotherVariableHere

‘Odd’ characters in string data User specifies the characters
Spelling mistakes (non-dictionary words) in string data using a dictionary file User specifies a dictionary file
Percentage of values missing (‘Sys miss’ and undefined missing) User sets the threshold, e.g. more than 25%

Disclosure control checks

Identifying disclosure risk from unique values or low thresholds (frequencies of categorical vars or minimum values) User sets the threshold value, e.g. 5
Direct identifiers using a RegEx pattern search

User runs separately for postcodes, telephone numbers etc.

Advise tests are separately as may be resource intensive

The tool development: technology choices

We are very fortunate to have Jon Johnson, ex database manager for the British Cohort Studies as our lead on technical work.

At the time he was leading on the user side on the UK Data Service’s big data platform work (Smart Meter Research Portal) with UCL, thus bringing dual aspects of small-scale survey work with the challenge of ingesting and quality-assessing large-scale streaming data. The tool envisaged should be able to consider a range of QA solutions for all numeric data, regardless of scale.

We were also happy to have recruited a local part-time programmer, a dynamic final year computer science undergraduate, who had previously worked for the UK Data Archive. Myles Offord proved to be an ambitious and hugely productive software engineer who undertook some thorough R&D with Jon before the final software solutions were selected.

The choice of technology underpinning the tool went through at least four months of research, experimenting with different open source programming languages and libraries of statistical functions. R, Python and Clojure, initially on SPSS and STATA files.

During the course of the development phase of project, we found that the open source library, Readstat supported all the commonly used file types. And had been noticed in the statistical community. As the library was being actively maintained in the community, it is a strong backbone for QAMyData and ultimately, was a very good choice for the tool.

As different statistical software treats data differently, input checks needed to be software specific; for example, Stata insists on its own input conditions. Output reporting had to ensure that a standard frame as built and that the error/issue were easily locatable by the user.

A relatively new agile programming language, Rust, was discovered and selected as the best choice for the wrapper. The Rust application was developed, iterated and the code is published on the UK Data Service GitHub along with comprehensive instructions on how to download the programme

The software was designed to be easily downloaded to a laptop or server and can be quickly used and integrated into data cleaning and processing pipelines for data creators, users, reviewers and publishers.

The QAMyData software is available for Linux, Windows and Mac, and can be quickly used and integrated into data cleaning and processing pipelines. It is available to download from the UK Data Service Github pages under a MIT Licence.

Running the tests

The first release of QAMyData allowed a small number of critical quality tests to run, the intention being to add the remaining desirable tests following initial external testing.

SPSS, Stata, SAS and CSV formats can all be loaded in. The tool uses a configuration file (written in yaml) that has default settings for each test; such as a threshold for pass or fail on various tests (e.g. detect value labels that are truncated, email addresses identified as a string, or undefined missing values) can be easily adapted.

# Checks whether any user-defined missing values do not have labels (sysmis) - SPSS only

value_defined_missing_no_label:

setting: true

desc: "User-defined missing values should have a label (SPSS only)"

Example of a check in the config file

The regular expressions checks to detects e.g. emails or telephone number and so on, can be quite resource intensive to run, so these are best run separately; they can be commented out from the default configuration file  and run again.

regex_patterns:

setting:

- "^[A-Za-z]{1,2}[0-9A-Za-z]{1,2}[ ]?[0-9]{0,1}[A-Za-z]{2}$"

desc: Values matching the regex pattern fail (full UK postcodes found in the data)

Example of a regex check in the config file

The software creates a report as a ‘data health check’ that details errors and issues, as both a summary and providing a location of the failed test.

Tests run that are highlighted in green in the summary report have passed, meaning that there were no issues encountered according to the thresholds set.

Failed tests are shown in red, indicating that QAMyData has identified issues in particular variables or values.

To locate the problems, a user can click on a red test, which takes them to more detailed table, which shows the first 1000 cases.

In the example below, to view the results of the failed ‘Variable odd characters’ test, a click on the failed test will scroll down to the result, in this case that variables V137 and OwnTV contain “odd” characters in their label.

Summary report from the QAMyData tool

Example of a summary report

Detailed report from the QAMyData tool

Example of a detailed report for particular failed checks

Data depositors and publishers can act on the results and resubmit the file until a clean bill of health is produced.

Testing, testing

The project undertook evaluation of the tool, algorithm and upload process with researchers, teachers, students and data repositories, including partner international data archives, university data repositories and journals.

Our first hands-on session with users was held as an NCRM course in February at the LSE focused around the principles of, and tools for Assessing data quality and disclosure risk in numeric data.

Half of the 20 attendees came from outside the academia sector, from government departments and the third sector.

For the hands-on part, materials included a worksheet on data quality, using a purposefully ‘messy’ dataset, containing common errors that we asked participants to locate and deal with.

Feedback from this early workshop recognised the importance of undertaking data assessment. Given the implications of the GDPR when creating and handling data, users also appreciated opportunities for a greater understanding of how to review data for direct identifiers and welcomed the idea of a simple, free and extensible tool to help with data cleaning activities.

We received feedback that the tool would be useful in teaching on quantitative data analysis courses, suggesting that it would be useful to set up a longer-term dedicated teaching instance.

Further presentation and hands-on training sessions held from March to June unearthed some constructive more feedback on accessing the software, pointing to improvements to the User Guide and suggestions for additional data checks.

By the end of the second workshop our resources were refined, ready for release. We are delighted that we experienced such interest from a variety of sectors, and expect more enquiries and opportunities to promote and showcase the tool and training aspects of the project.

The final few weeks of our project saw the team fully document the tool, annotate the config file and provide a step-by-step user guide.

Page from QAMyData User Guide

Page from QAMyData User Guide

Capacity building aims and deliverables

One of the key aims was also to support interdisciplinary research and training by creating practical training materials that focus on understanding and assessing data quality for data production and analysis.

We sought to incorporate data quality assessment into training in quantitative methods. In this respect, both the UK Data Service training offerings and NCRM research methods training nodes are excellent vehicles for promoting such a topic.

A training module on what makes a clean and well-documented numeric dataset was created. This included a very messy and purposely-erroneous dataset and training exercises compiled by Cristina Magder, plus a detailed user guide. These were road tested and versions iterated during early training sessions.

The tool in use

Version 1.0 of the QAMyData tool is available for use. Since releasing earlier versions of the software in the spring, we have undertaken some work to embed the tool into core workflows in the UK Data Service.

The Data Curation team now use it to QA data as it comes in, to help with data assessment, and we are scoping the needs for integration into the UK Data Service self-archiving service, ReShare, so that depositors can check their numeric data files before they submit them for onward sharing.

We hope that the tool will be picked up and used widely, and that the simple configuration feature will enable data publishers to create and publish their own unique Data Quality Profile, setting out explicit standards they wish to promote.

We welcome new suggestions for new tests, which can be added by opening a ‘New Issue’ on the Issues space in our Github area.

Open a new issue in our Github space

End note

Louise gained a grant from the National Centre for Research Methods (NCRM) under its Phase 2 Commissioned Research Projects fund, which enabled us to employ a project technical lead and a software engineer. The project ran from January 2018 to July 2019 and version 1.0 of the QAMyData tool is available for use.

The QAMyData project, and its resulting software and training materials, was a very satisfying project to lead. My colleagues Jon Johnson, Myles Offord, Cristina Magder, Anca Vlad and Simon Parker were a real pleasure to work, making up a friendly dedicated team, who were open to ideas and responsive to the feedback from user testing.


Louise Corti is Service Director, Collections Development and Producer Relations for the UK Data Service. Louise leads two teams dedicated to enriching the breadth and quality of the UK Data Service data collection: Collections Development and Producer Support. She is an associate director of the UK Data Archive with special expertise in research integrity, research data management and the archiving and reuse of qualitative data.

 

Mapping census data with ArcGIS Online

Rachel Oldroyd and Luke Burns work step by step through the process of using ArcGIS Online to map census data.

Geographical Information Systems (GIS) have been marked as one of the most important technological developments of the 21st century, providing powerful analytical tools which inform decision making across a number of disciplines. GIS now forms part of the Secondary Geography Curriculum in England, but it’s often difficult for teachers to delegate time to learn a new piece of software alongside other conflicting priorities.

In this tutorial we use a free, online GIS to map UK 2001 census data. ArcGIS Online, provided by ESRI, is easy to use and does not require downloading or installing, so is well suited for use in the classroom. We provide step by step instructions to map and interpret census data, we also provide debugging tips which cover some of the common problems encountered with ArcGIS Online.

Open ArcGIS Online now using your Web Browser by visiting www.arcgis.com/home/webmap/viewer.html. There is no need to make an account or sign in for this tutorial, however doing so provides access to more advanced functionality.

Click on the ‘Modify map’ link from the top right hand corner of the page to begin.

The Web Map Viewer

  1. On the homepage you will see a new map which is centred on the UK. There is a toolbar along the top of the window with a number of different tools and a side panel on the left hand side.

 

  1. At the top of the side panel, you will see three buttons, ‘About’, ‘Content’ and ‘Legend’ which provide further information about the map and the content. If you click on the legend and content buttons now, you will see that the map is currently empty aside from the base map which is called ‘topographic’.

 

  1. To navigate the map window you can click and drag the map to pan to a different location. You can also zoom in and out, zoom to the default extent (the UK) or zoom to your current location using the buttons on the left hand side of the window.

  1. At the moment we are using a base map called ‘topographic’. You can change the base map by clicking on the ‘Basemap’ button on the left hand side of the top toolbar.

  1. Using the zoom and pan buttons, locate Leeds and zoom in such that it occupies the majority of your screen.

Finding Census Data

We will now add 2001 census data. The UK Data Service website contains a wide range of census data to work with.

  1. In a new tab in your web browser. Visit the census pages of the UK Data Service website at the following link: http://casweb.mimas.ac.uk. This website contains huge amounts of census data and as such you will need to specify which data you would like and at which geography (e.g. which country, county, city etc).

 

  1. From the CasWeb homepage, click the ‘Start CasWeb’ link followed by the first link ‘2001 Aggregate Statistics Datasets (with digital boundary data)’. This data format comes in a geographical format (a shapefile) and is ready to map. [Note: unfortunately we cannot use 2011 data as it does not come in the same geographical format, but later in the exercise we compare the 2001 dataset to the 2011 dataset and discuss the changes]. Notice that you can also download data from as far back as 1971.

  1. We now need to specify where we would like to download our census data for – this could be anywhere, but we will focus on West Yorkshire. Use the on-screen options to locate West Yorkshire by selecting the country (England), then ‘select lower geographies’, then select the region (Yorkshire and The Humber), then select ‘Select Counties’, then select the County (West Yorkshire). The illustrations below step you through this process:
  • In Step 1, we specify the country to select. Then choose ‘select lower geographies’ to select a region within England.

  • In Step 2, we narrow down our search for West Yorkshire by specifying the region in which it belongs.

  • In Step 3, we are able to select the West Yorkshire county having searched through the country and region to find this.  Here we can click the ‘Select output Level’ button as we do not need to continue our search.If we wanted to continue and filter Leeds, Bradford etc. we could do.

 

  1. Before we choose the variables to download we first need to specify a geography. Notice the four options presented to you at the bottom of the screen: District, ST Wards, CAS Wards and OA. The smaller the geography, the more detail you will get – you can think of this as being similar to cutting a cake, you can cut it into big or small pieces. In this example, these pieces range from Districts (5 pieces – one for each of Leeds, Bradford, Wakefield, Calderdale and Kirklees) to OAs (7,131 tiny pieces). Let’s select CAS Wards which breaks West Yorkshire down into a manageable 126 areas – select CAS Wards followed by the ‘Select Data…’

  1. Now it is over to you! Using the table towards the bottom of the page, browse the different datasets that you can download for West Yorkshire.  You can highlight a row and click the ‘Display Table Layout’ button to explore the range of data available within each themed table.  Select two datasets that may interest you.

Example using ‘people with poor health’

The instructions below show how to select the number of people with poor health but you should choose a dataset that interests you.

  1. To select persons with poor health visit the ‘Health and provision of unpaid care’ row (KS008) and click the ‘Display Table Layout’

  1. Browse the options available and select box 6 – people who report their general health as ‘Not Good’.

  1. To add this data to your ‘basket’ to download, click the ‘Add variables to data selection’ button above the table. Notice how this adds to the list of data to be downloaded on the right-hand side of the page.

  1. You can then continue searching for data by using the back button above the table or you can proceed to download your data by selecting the ‘Get Data’ button to the top right of your screen.

 

  1. The final step is to give your data a name and select the file format. As we want to map the data we need to check the little button next to Digital Boundary data. You can then click ‘Execute Query’ which will start saving your data to a specified location.

 

Mapping the Data

We have now sourced and downloaded some census data. You may have downloaded the health data stepped through above or you may have downloaded different data. Hopefully you have at least two datasets to explore here as we look to map this.

 

  1. Now, let’s add the data you have just downloaded from CasWeb to ArcGIS Online. In ArcGIS Online, click on the down-arrow beside the Add button to the top left of the main window and choose Add Layer from File. Browse to find the zipped CasWeb file you downloaded earlier, select this and click Import Layer. This may take a few seconds to display on screen.

 

  1. ArcGIS Online will add a default style but this is not always appropriate. Using the drop-down attribute box to the left of the map widow, select one of the variables you downloaded (these will be the longer numbers, in order of selection on CasWeb). You may need to go back into CasWeb to find out which variables the numbers refer to.

  1. Select one of the variables and notice how the software tries to map this for you. Experiment with the display options to show this in a way that you are happy with (e.g. using the Counts and Amounts (colour) option).  Click Done once complete to save this.

 

  1. Spend some time navigating the map and trying to understand the spatial distribution of your data (in the example provided, people who report ‘not good’ health). You may wish to add area labels to make this easier.  Clicking on the three dots “…” beside your map layer and choose Create Labels. Choose Area Labels as opposed to the code to add more useful labels.

 

  1. By this point you may be happy to stop and re-practice the above and if so that is fine – you have downloaded data, mapped it and looked for spatial patterns. However, a nice addition to this session is to compare the currently mapped data with a second dataset to see if any patterns exist between the two.  As you already have one dataset mapped (in my case, people who report ‘not good’ health), it is time to add a second.  To do this, click on the little three dots beside your map layer again “…” and click Copy.

 

  1. This will create a duplicate map layer. You can now repeat the steps followed previously to display a second dataset using this layer (for me, my second dataset is households without access to a car).  Note that this time it would be wise to choose a different method of visualisation so one dataset can be seen ‘on top’ of the other.  If you used Counts and Amounts (colour) last time, using Counts and Amounts (size) this time will enable you to see both datasets at the same time and hopefully draw some comparisons – see my example overleaf.

 

  1. Once selected, click Done and ensure that the map legend (or key) is showing by clicking on the appropriate icon to the top left of the screen. Doing this will enable you to see what colours/symbols represent high values and those that represent low values. You can then explore and compare the data layers.

Follow up questions

If you are running this tutorial with your students, you may want to ask them to think about the following questions:

Q1.  What patterns do both of your datasets show?  What parts of West Yorkshire show particularly high and low values?  Are there are reasons for this?

Q2.  Do both datasets seem to correlate in any way – for example, do they both have high and low values in the same areas or are these rather different?  Does this pattern match what you might have expected?

Q3.  Can you think of any problems with presenting data in this way?  Are the colours or symbols misleading?

Q4.  Visit the Datashine website to compare your dataset(s) from 2001 to those from 2011.  Are the patterns the same or have things changed?  http://datashine.org.uk/ [Note: You will need to pan the map to find West Yorkshire and then use the menu (top right) to locate and map the data – you may find that your datasets(s) aren’t available to select though as the website only contains a selection!]

Debugging tips

The ArcGIS online software is extremely easy to use, only occasionally should you run into problems. Here are a few common scenarios and how to fix them:

1) The ‘Add’ button isn’t visible on the top toolbar.

Simply click the ‘Modify Map’ button in the top right hand corner and it will appear.

2) The ‘table of contents’ panel has disappeared.

Click the ‘Details’ button on the top left hand size of the page and it will reappear.

5) I can’t find the option for ‘change style’

In the ‘Details’ window on the left hand side of the page (see 2 if you can’t see this), ensure that the middle tab is selected – named ‘Show Contents of Map’- hover over the layer name and you will see the ‘Change style’ icon.

6) I’ve changed the style and now I can’t get rid of the ‘Change Style’ window.

Ensure you have clicked ‘OK’ or ‘Done’ at the bottom of the left hand window. The default ‘Details’ window should then appear.

In the very unlikely event that you run into a problem you can’t fix, close the window down and reopen the map.


rachel oldroyd

Rachel Oldroyd is one of our UK Data Service Data Impact Fellows. Rachel is a quantitative human geographer based at the Consumer Data Research Centre (CDRC) at the University of Leeds, researching how different types of data (including TripAdvisor reviews and social media) are used to detect illness caused by contaminated food or drink.

Luke Burns is a Lecturer in Quantitative Human Geography at the University of Leeds. His work focuses on the advanced application of geographical information systems to socioeconomic problems & the development of geodemographic classification systems and composite indicators.  

Thinking geographically about ethnic and socio-economic segregation

Richard Harris discusses a different approach to measuring ethnic and socio-economic segregation and fitting a multilevel index of segregation to census data in R.

The measurement of segregation has been debated in the social sciences for well over half a century.

Concerns about segregation, and the potential for it to harm society, are prevalent within recent Government reports and proposals, occasionally generating lurid (and often mis-leading) headlines in the media. Understandably, policy makers and other interested parties would like to know how much segregation there is and whether it is increasing.

However, their desire can be frustrated. The quest – sometimes taken by social scientists – for one perfect measure that could definitively answer those questions continues to allude researchers and always will.

Whilst much academic ink has been spent, for example, discussing the mathematical properties of different indices, there is more to the measurement than maths: different approaches reflect different conceptions of what segregation means as a process or as an outcome. Moreover, as new data, new computational tools and new thinking emerge, it is right to re-evaluate what is being measured, how and why.

A forthcoming issue of the journal Environment and Planning B: Urban Analytics and City Science to be published later this year (volume 45, issue 6) notes that after decades in which studies of residential segregation have been dominated by the use of descriptive indices – such as those of dissimilarity and isolation – there has been a recent surge of interest in developing those measures to provide greater insights into the observed patterns plus the processes that produce them.

Amongst that interest is one in multi-scale and multi-level measures that were showcased at a special session at the recent ESRC Research Methods festival enable the measurement of segregation simultaneously at multiple scales of analysis, aiming to generate insight into the different processes that create patterns of segregation at micro-, meso- and macro-scales.

One such method is the Multilevel Index of Dissimilarity (MLID), which builds on the commonly used Index of Dissimilarity but can capture both the numeric scale of segregation (the amount) and also the spatial scale (the pattern), which the standard index cannot.

The MLID aims to be a simple as possible, to be user-friendly and to run in open source software as a package in R, allowing for the importance of reproducibility. Because it is simple it also is fast.

Whereas other approaches may take hours or days to run and/or are limited to small data sets and study regions, the MLID operates in the order of minutes on small area census data for the whole of England and Wales.

Even if it used as a pre-cursor to more advanced approaches, it offers more ‘interactivity’ with the data at the early stages of analysis, helping to judge whether more complex approaches are warranted. A description of the MLID, its theoretical derivation, its interpretation and its implementation in R are available in this open access article.

To consider its usefulness, consider the following example. Standard indices of segregation treat the four patterns shown in Figure 1 as the same – more accurately, they don’t consider them at all: numerically the amount of segregation is the same in each case. But clearly the patterns and therefore the nature of the segregation is not the same in all four occurrences. The MLID captures the differences – it is a measure of clustering as well as unevenness across the study region.

Figure 1. Standard indices of segregation cannot differentiate between these patterns but the multilevel index of dissimilarity (MLID) can.

Because it is able to differentiate between the numeric and geographic scales of segregation, the MLID can be used to quantify a measure of spatial diffusion, as in the example below.

In fact, this is the kind of process that is occurring within the UK where the segregation of ‘minority’ groups is decreasing as they spread out from the areas in which previously they were concentrated into more mixed neighbourhoods.

Figure 2. The MLID can be used to measure a process of spatial diffusion.

To introduce potential users to the MLID, a tutorial is available at https://cran.r-project.org/web/packages/MLID/vignettes/MLID.html which also provides an overview of the MLID package in R.

It is a case study using census data to consider patterns of ethnic segregation at a range of scales – for example, the differences between London and the rest of the country but also the internal heterogeneity of London.

Further information (the slides from the Research Methods Festival) is available at here.

Better measurement, better models and better data are not a complete panacea for understanding how and why segregation is created nor its consequences.

The challenge of understanding how patterns relate to processes and to outcomes remains but the hope is that a (relatively) easy to use, multiscale measure of segregation will help to enhance our understanding of segregation as a geographical outcome and as a geographical contributor to geographical processes that are therefore better measured geographically.

 

Richard Harris is Professor of Quantitative Social Geography at the School of Geographical Sciences at the University of Bristol and co-author (with Ron Johnston) of the forthcoming book, Ethnic Segregation between Schools: is it increasing or decreasing in the England? (Bristol Press, 2019). The research was funded under the ESRC’s Urban Big Data Centre, ES/L011921/1.

Our Q-step interns look back over their time with us

This summer, we were lucky enough to have Manchester students Rabia Butt and Klara Valentova joined us for a Q-step internship. Q-Step was developed as a strategic response to the shortage of quantitatively-skilled social science graduates. Manchester is one of 15 centres taking part in a £19.5 million programme designed to promote a step-change in quantitative social science training.

Here they review their time with us.

The purpose of our research was to explore deprivation in the UK using the Census data and the Carstairs Index of Deprivation.

To comprehend the overall transformation in the level of deprivation within the UK, the Census data from 1971 to the most recent one of 2011 was used in the calculation. In addition, different types of geographical levels, such as local authority, ward, lower super output areas, output area, and district were used for the calculation. This has now been accomplished. throughout this process, we learned many new techniques and skills, which we applied to achieve the final outcomes.

Our research has found that most of the extremely deprived areas are in cities.

To investigate why that might be the case, the individual scores of each of the indicators applied to produce the Carstairs scores were examined as they could explain what variable contributed to the specific level of deprivation the most.

It was found that in cities ‘Non-car ownership’ caused the final deprivation score to be so large. Owning a car in a city, however, can be very impractical, and so this suggests that the indicators used for the calculation of the Carstairs index are outdated and may need to be revised and possibly replaced with more relevant ones.

The most deprived areas overall are located in the City of London and Glasgow City, while the least deprived areas can be found in the suburbs of London, particularly in the southwest of London in towns such as Wokingham or cities such as St Albans.

That being said, our project has also found that some of the least deprived areas are also located in City of London and around Glasgow.

This came as a surprise since these two areas, in particular, seem to have to the highest levels of deprivation. This discovery was possible to made only due to the use of the Carstairs index as it allows for the analysis of smaller geographies such as output areas.

These results demonstrate that the Carstairs index is valuable in noticing small areas with high levels of deprivation be which would not be recognised as deprived when using other deprivation measures, for example the Indices of Multiple Deprivation.

We were also able to compare deprivation across different Census years for the whole of UK, and it was found that deprivation has decreased significantly between 1981 and 2011.

Deprivation in North Ireland, Scotland and Wales decreased the most, while in England this was the case for only certain areas. There was slight improvement for the south of England due to these areas always being less deprived. On the other hand, the north of England used to be greatly deprived especially in comparison to the South. Areas in the North have lowered their deprivation scores greatly, however, they still remain more deprived the South.

Rabia

I have learnt a great deal during the process of our research, starting from not knowing much about Carstairs, R language, 3D mapping on QJIS, virtual reality and many more, to now having completed a research report.

I am most grateful for the experience I have gained during my internship and have not only improved on the skills I had, but also for acquiring new set of skills, which I will definitely be benefiting from in the near future.

One of the key skills which I have developed in is data analysis, especially when regarding quantitative data. This will definitely assist me in the final year of my degree, in the process of my dissertation.

Likewise, this internship has enhanced my knowledge of the role of data analyst, which was one of the reasons to why I wanted to do my internship here. This is because I wanted the experience of working for the UK Data Service to help me in determining the career path I wanted to take after I graduate.

I have enjoyed creating this project with a team member (Klara), as we were able to assist each other throughout the process in achieving the desired outcomes. This whole journey has definably helped me in my development, which has now sadly come to end.

Klara

I started this internship having basic data analysis skills, which I wanted to make use of and enhance them to a profession level. At the same time, I was hoping this experience would help me to decide what I would like to do after graduating, and whether data analysis and statistics would be something I would enjoy even outside of courses at the University. Working at the UK Data Service has fulfilled all the above.

I have learned many new skills and have developed on a professional level while discovering a whole new range of what I can do with data and its results (more about that in my previous blog at https://lab.ukdataservice.ac.uk/2018/08/21/klara-and-carstairs/).

Apart from developing knowledge of using different software such as QGIS or Microsoft Access, the one skill I developed the most was how to be successful when working in a team.

Having to work with another person, taking their idea on board and compromising has been very challenging for me yet incredibly rewarding. It was due to our excellent communication and mutual respect and understanding that we were able to calculate all the scores, do the analysis and finally produce the final report.

I have also discovered that I quite enjoyed guiding and helping Rabia throughout the internship, which helped me to gain very strong interpersonal skills that will be important in both my personal and professional life.

I am now very excited to have completed this project and to have gained so many invaluable skills and experiences. Nonetheless, I am sad to be leaving the UK Data Service as I enjoyed every task I had to complete, and I always felt very excited about working on this research. I am happy, however, that I now know I would like to do this type of job after finishing University because I enjoy it and find it very fulfilling and worthwhile.