QAMyData: A health check for numeric data

Louise Corti reports on the QAMyData tool recently developed by the UK Data Service. The tool is a free easy-to-use open source tool that provides a health check for numeric data. It uses automated methods to detect and report on some of the most common problems in survey or numeric data, such as missing data, duplication, outliers and direct identifiers.

Why is a checking tool needed?

Social science research benefits from accountability and transparency, which can usefully be underpinned by high quality and trustworthy data.

It can be a challenge when curating data to locate appropriate tools for checking and cleaning data in order to make data FAIR. The tasks of checking, cleaning and documenting data by repository staff can be manual and time-consuming.

Disciplines that require high quality data to be created, prepared and shared benefit from more regularised ways to do this.

The objective of this grant was to develop a light weight, open-source tool based on open-source software for implementation of quality assessment of research data.

This can be viewed as a ‘data health check’ that identifies the most common problems in data submitted by users in disciplines that utilise quantitative methods. We believe this could be appealing to a range of disciplines beyond social science where work involves surveys, clinical trials or other numeric data types. Furthermore, a tool that could be easily slotted into general data repository workflow would be appealing.

How were quality checks selected?

Requirements were gathered around the kinds of checks that could be included through a series of engagement exercises with the UK Data Service’s own data curation team, other data publishers, managers and quantitative researchers.

A number of repositories were invited to online meetings to discuss their data assessment methods. Common examples included:

  • column and row (and case) issues
  • missing values
  • missing or incomplete labels
  • odd characters
  • obviously disclosive personal information.

A comprehensive list of ‘tests’ was produced to include the most commonly used when quality assessing numeric data files.

Next the team worked on appropriate methods of assessment, how to set up input controls and consider reporting thresholds. For example, what threshold might constitute a ‘fail’?

A critical feature of the tool was that a user should be able to specify and set thresholds to indicate what they are prepared to accept, be it no missing data or data must be fully labelled.

Issues would be identified in both a summary and detailed report, setting out where to find the error/issue (for example, the column/variable and case/row number).

This at-a-glance report aspect is appealing for data repositories, to help quickly assess data as it comes in, instead of relying on manual processes that are often a large part of the evaluation workflow. An early plan was also that the system must be extensible to add new tests.

The types of checks were broken down into four types:

File checks

File opens Checks whether acceptable format
Bad filename check, regular expression via RegEx pattern Regex requires quotes “[a-z]”. To use a special characters, e.g. a backslash (\) a backslash before is required e.g. \\

Metadata checks

Report on number of cases and variables Always run
Count of grouping variables
Missing variable labels Must be set to true. If set to false the test will not run
No label for user defined missing values e.g. – 9 not labelled SPSS only
‘Odd’ characters in variable names and labels User specifies the characters
‘Odd’ characters in value labels User specifies the characters
Maximum length of variable labels, e.g. >79 characters User specifies the length
Maximum length of value labels, e.g. >39 characters User specifies the length
Spelling mistakes (non-dictionary words) in variable labels using a dictionary file User specifies a dictionary file.
Spelling mistakes (non-dictionary words) in value labels using a dictionary file User specifies a dictionary file

Data integrity checks

Report number of numeric and string variables Always run
Check for duplicate IDs

User specifies the variables. Multiple variables can be added on new lines e.g.

– Caseno

– AnotherVariableHere

‘Odd’ characters in string data User specifies the characters
Spelling mistakes (non-dictionary words) in string data using a dictionary file User specifies a dictionary file
Percentage of values missing (‘Sys miss’ and undefined missing) User sets the threshold, e.g. more than 25%

Disclosure control checks

Identifying disclosure risk from unique values or low thresholds (frequencies of categorical vars or minimum values) User sets the threshold value, e.g. 5
Direct identifiers using a RegEx pattern search

User runs separately for postcodes, telephone numbers etc.

Advise tests are separately as may be resource intensive

The tool development: technology choices

We are very fortunate to have Jon Johnson, ex database manager for the British Cohort Studies as our lead on technical work.

At the time he was leading on the user side on the UK Data Service’s big data platform work (Smart Meter Research Portal) with UCL, thus bringing dual aspects of small-scale survey work with the challenge of ingesting and quality-assessing large-scale streaming data. The tool envisaged should be able to consider a range of QA solutions for all numeric data, regardless of scale.

We were also happy to have recruited a local part-time programmer, a dynamic final year computer science undergraduate, who had previously worked for the UK Data Archive. Myles Offord proved to be an ambitious and hugely productive software engineer who undertook some thorough R&D with Jon before the final software solutions were selected.

The choice of technology underpinning the tool went through at least four months of research, experimenting with different open source programming languages and libraries of statistical functions. R, Python and Clojure, initially on SPSS and STATA files.

During the course of the development phase of project, we found that the open source library, Readstat supported all the commonly used file types. And had been noticed in the statistical community. As the library was being actively maintained in the community, it is a strong backbone for QAMyData and ultimately, was a very good choice for the tool.

As different statistical software treats data differently, input checks needed to be software specific; for example, Stata insists on its own input conditions. Output reporting had to ensure that a standard frame as built and that the error/issue were easily locatable by the user.

A relatively new agile programming language, Rust, was discovered and selected as the best choice for the wrapper. The Rust application was developed, iterated and the code is published on the UK Data Service GitHub along with comprehensive instructions on how to download the programme

The software was designed to be easily downloaded to a laptop or server and can be quickly used and integrated into data cleaning and processing pipelines for data creators, users, reviewers and publishers.

The QAMyData software is available for Linux, Windows and Mac, and can be quickly used and integrated into data cleaning and processing pipelines. It is available to download from the UK Data Service Github pages under a MIT Licence.

Running the tests

The first release of QAMyData allowed a small number of critical quality tests to run, the intention being to add the remaining desirable tests following initial external testing.

SPSS, Stata, SAS and CSV formats can all be loaded in. The tool uses a configuration file (written in yaml) that has default settings for each test; such as a threshold for pass or fail on various tests (e.g. detect value labels that are truncated, email addresses identified as a string, or undefined missing values) can be easily adapted.

# Checks whether any user-defined missing values do not have labels (sysmis) - SPSS only

value_defined_missing_no_label:

setting: true

desc: "User-defined missing values should have a label (SPSS only)"

Example of a check in the config file

The regular expressions checks to detects e.g. emails or telephone number and so on, can be quite resource intensive to run, so these are best run separately; they can be commented out from the default configuration file  and run again.

regex_patterns:

setting:

- "^[A-Za-z]{1,2}[0-9A-Za-z]{1,2}[ ]?[0-9]{0,1}[A-Za-z]{2}$"

desc: Values matching the regex pattern fail (full UK postcodes found in the data)

Example of a regex check in the config file

The software creates a report as a ‘data health check’ that details errors and issues, as both a summary and providing a location of the failed test.

Tests run that are highlighted in green in the summary report have passed, meaning that there were no issues encountered according to the thresholds set.

Failed tests are shown in red, indicating that QAMyData has identified issues in particular variables or values.

To locate the problems, a user can click on a red test, which takes them to more detailed table, which shows the first 1000 cases.

In the example below, to view the results of the failed ‘Variable odd characters’ test, a click on the failed test will scroll down to the result, in this case that variables V137 and OwnTV contain “odd” characters in their label.

Summary report from the QAMyData tool

Example of a summary report

Detailed report from the QAMyData tool

Example of a detailed report for particular failed checks

Data depositors and publishers can act on the results and resubmit the file until a clean bill of health is produced.

Testing, testing

The project undertook evaluation of the tool, algorithm and upload process with researchers, teachers, students and data repositories, including partner international data archives, university data repositories and journals.

Our first hands-on session with users was held as an NCRM course in February at the LSE focused around the principles of, and tools for Assessing data quality and disclosure risk in numeric data.

Half of the 20 attendees came from outside the academia sector, from government departments and the third sector.

For the hands-on part, materials included a worksheet on data quality, using a purposefully ‘messy’ dataset, containing common errors that we asked participants to locate and deal with.

Feedback from this early workshop recognised the importance of undertaking data assessment. Given the implications of the GDPR when creating and handling data, users also appreciated opportunities for a greater understanding of how to review data for direct identifiers and welcomed the idea of a simple, free and extensible tool to help with data cleaning activities.

We received feedback that the tool would be useful in teaching on quantitative data analysis courses, suggesting that it would be useful to set up a longer-term dedicated teaching instance.

Further presentation and hands-on training sessions held from March to June unearthed some constructive more feedback on accessing the software, pointing to improvements to the User Guide and suggestions for additional data checks.

By the end of the second workshop our resources were refined, ready for release. We are delighted that we experienced such interest from a variety of sectors, and expect more enquiries and opportunities to promote and showcase the tool and training aspects of the project.

The final few weeks of our project saw the team fully document the tool, annotate the config file and provide a step-by-step user guide.

Page from QAMyData User Guide

Page from QAMyData User Guide

Capacity building aims and deliverables

One of the key aims was also to support interdisciplinary research and training by creating practical training materials that focus on understanding and assessing data quality for data production and analysis.

We sought to incorporate data quality assessment into training in quantitative methods. In this respect, both the UK Data Service training offerings and NCRM research methods training nodes are excellent vehicles for promoting such a topic.

A training module on what makes a clean and well-documented numeric dataset was created. This included a very messy and purposely-erroneous dataset and training exercises compiled by Cristina Magder, plus a detailed user guide. These were road tested and versions iterated during early training sessions.

The tool in use

Version 1.0 of the QAMyData tool is available for use. Since releasing earlier versions of the software in the spring, we have undertaken some work to embed the tool into core workflows in the UK Data Service.

The Data Curation team now use it to QA data as it comes in, to help with data assessment, and we are scoping the needs for integration into the UK Data Service self-archiving service, ReShare, so that depositors can check their numeric data files before they submit them for onward sharing.

We hope that the tool will be picked up and used widely, and that the simple configuration feature will enable data publishers to create and publish their own unique Data Quality Profile, setting out explicit standards they wish to promote.

We welcome new suggestions for new tests, which can be added by opening a ‘New Issue’ on the Issues space in our Github area.

Open a new issue in our Github space

End note

Louise gained a grant from the National Centre for Research Methods (NCRM) under its Phase 2 Commissioned Research Projects fund, which enabled us to employ a project technical lead and a software engineer. The project ran from January 2018 to July 2019 and version 1.0 of the QAMyData tool is available for use.

The QAMyData project, and its resulting software and training materials, was a very satisfying project to lead. My colleagues Jon Johnson, Myles Offord, Cristina Magder, Anca Vlad and Simon Parker were a real pleasure to work, making up a friendly dedicated team, who were open to ideas and responsive to the feedback from user testing.


Louise Corti is Service Director, Collections Development and Producer Relations for the UK Data Service. Louise leads two teams dedicated to enriching the breadth and quality of the UK Data Service data collection: Collections Development and Producer Support. She is an associate director of the UK Data Archive with special expertise in research integrity, research data management and the archiving and reuse of qualitative data.

 

Thinking geographically about ethnic and socio-economic segregation

Richard Harris discusses a different approach to measuring ethnic and socio-economic segregation and fitting a multilevel index of segregation to census data in R.

The measurement of segregation has been debated in the social sciences for well over half a century.

Concerns about segregation, and the potential for it to harm society, are prevalent within recent Government reports and proposals, occasionally generating lurid (and often mis-leading) headlines in the media. Understandably, policy makers and other interested parties would like to know how much segregation there is and whether it is increasing.

However, their desire can be frustrated. The quest – sometimes taken by social scientists – for one perfect measure that could definitively answer those questions continues to allude researchers and always will.

Whilst much academic ink has been spent, for example, discussing the mathematical properties of different indices, there is more to the measurement than maths: different approaches reflect different conceptions of what segregation means as a process or as an outcome. Moreover, as new data, new computational tools and new thinking emerge, it is right to re-evaluate what is being measured, how and why.

A forthcoming issue of the journal Environment and Planning B: Urban Analytics and City Science to be published later this year (volume 45, issue 6) notes that after decades in which studies of residential segregation have been dominated by the use of descriptive indices – such as those of dissimilarity and isolation – there has been a recent surge of interest in developing those measures to provide greater insights into the observed patterns plus the processes that produce them.

Amongst that interest is one in multi-scale and multi-level measures that were showcased at a special session at the recent ESRC Research Methods festival enable the measurement of segregation simultaneously at multiple scales of analysis, aiming to generate insight into the different processes that create patterns of segregation at micro-, meso- and macro-scales.

One such method is the Multilevel Index of Dissimilarity (MLID), which builds on the commonly used Index of Dissimilarity but can capture both the numeric scale of segregation (the amount) and also the spatial scale (the pattern), which the standard index cannot.

The MLID aims to be a simple as possible, to be user-friendly and to run in open source software as a package in R, allowing for the importance of reproducibility. Because it is simple it also is fast.

Whereas other approaches may take hours or days to run and/or are limited to small data sets and study regions, the MLID operates in the order of minutes on small area census data for the whole of England and Wales.

Even if it used as a pre-cursor to more advanced approaches, it offers more ‘interactivity’ with the data at the early stages of analysis, helping to judge whether more complex approaches are warranted. A description of the MLID, its theoretical derivation, its interpretation and its implementation in R are available in this open access article.

To consider its usefulness, consider the following example. Standard indices of segregation treat the four patterns shown in Figure 1 as the same – more accurately, they don’t consider them at all: numerically the amount of segregation is the same in each case. But clearly the patterns and therefore the nature of the segregation is not the same in all four occurrences. The MLID captures the differences – it is a measure of clustering as well as unevenness across the study region.

Figure 1. Standard indices of segregation cannot differentiate between these patterns but the multilevel index of dissimilarity (MLID) can.

Because it is able to differentiate between the numeric and geographic scales of segregation, the MLID can be used to quantify a measure of spatial diffusion, as in the example below.

In fact, this is the kind of process that is occurring within the UK where the segregation of ‘minority’ groups is decreasing as they spread out from the areas in which previously they were concentrated into more mixed neighbourhoods.

Figure 2. The MLID can be used to measure a process of spatial diffusion.

To introduce potential users to the MLID, a tutorial is available at https://cran.r-project.org/web/packages/MLID/vignettes/MLID.html which also provides an overview of the MLID package in R.

It is a case study using census data to consider patterns of ethnic segregation at a range of scales – for example, the differences between London and the rest of the country but also the internal heterogeneity of London.

Further information (the slides from the Research Methods Festival) is available at here.

Better measurement, better models and better data are not a complete panacea for understanding how and why segregation is created nor its consequences.

The challenge of understanding how patterns relate to processes and to outcomes remains but the hope is that a (relatively) easy to use, multiscale measure of segregation will help to enhance our understanding of segregation as a geographical outcome and as a geographical contributor to geographical processes that are therefore better measured geographically.

 

Richard Harris is Professor of Quantitative Social Geography at the School of Geographical Sciences at the University of Bristol and co-author (with Ron Johnston) of the forthcoming book, Ethnic Segregation between Schools: is it increasing or decreasing in the England? (Bristol Press, 2019). The research was funded under the ESRC’s Urban Big Data Centre, ES/L011921/1.

Hacking data on nutrition and greenhouse gas emissions

Last week, I was lucky enough to be able to join a group of data geeks (I say that affectionately!) as they gathered in Manchester to explore two very different datasets.

The first was the National Diet and Nutrition Survey (NDNS – available from the UK Data Service), which collects information on the food consumption, nutrient intake and nutritional status of the general population aged 1.5 years and over living in private households in the UK, based on around one thousand representative people.

The second was a Greenhouse Gas Emissions (GHGE) dataset created by researchers Neil Chalmers, Ruth Slater and Leone Craig at The Rowlett Institute (University of Aberdeen) which has the current best approximations of emissions for each food product in the NDNS.

The evening was organised by members of the Greenhouse Gas and Dietary choices Open source Toolkit (GGDOT) project funded by N8 Agrifood: Sarah Bridle, Christian Reynolds, Joe Fennell and Ximena Schmidt.

There was a good turnout with people breaking into three groups to explore the data and see what they could come up with.

Group 1

Group 1 used R to read in the CSV files and then grouped data by user id and day number, as well as by aggregate CO2 emissions and calories per person per day.

They then plotted these, attempting to find out whether age predicted CO2 emission levels.

They reported that their linear model was not conclusive…

 

Group 2

Group 2’s question was “Who makes you kill the planet?

Digging down into this was an exploration of whether who you eat food with affects the planet – or your waistline.

The group plotted 16 categories of who survey respondents ate with, based around a notional evening meal period of 5pm to 8pm).

The initial plot suggests that you shouldn’t eat with people you don’t know…

This applied both for calories (figure below) and  for greenhouse gas emissions.

Putting them together showed a similar pattern, although it did look like the public aren’t all bad…

A curious discovery was also made during the group’s hack.

Apparently, people eat less cheese on a Sunday…

Group 3

Group 3 reported that they had attempted data visualisation by geography, although they ran out of time to complete what they had been aiming for:

It was a lot of fun seeing people get together to explore different datasets. It will be interesting to see what future GGDOT Hacknights throw up, as well as how these datasets might be used to attempt to change consumer behaviour.

The next two GGDOT Hacknights will take place in Durham on 18th October and York on 29th November.

 

Analysing Food Hygiene Rating Scores in R: a guide

Rachel Oldroyd, one of the UK Data Service Data Impact Fellows, takes a step-by-step approach to using R and RStudio to analyse Food Hygiene Rating Scores.

Data download and Preparation

In this tutorial we will look at generating some basic statistics in R using a subset of the Food Hygiene Rating Scores dataset provided by the Food Standards Agency (FSA).

Visit http://ratings.food.gov.uk/open-data/en-GB now and download the data for an area you are interested in. I’ve downloaded City of London Corporation.

R is able to parse XML files but it’s easier to load the file into Excel (or a similar package) and save as a CSV file (visit this page if you’re unsure how to do this: https://support.office.com/en-us/article/import-xml-data-6eca3906-d6c9-4f0d-b911-c736da817fa4).

R and RStudio

R is a statistical programming language and data environment.

Unlike other statistics software packages (such as SPSS and Stata) which have point and click interfaces, R runs from the command line. The main advantage of using the command line is that scripts can be saved and quickly rerun, promoting reproducible outputs. If you’re completely new to R, you may want to follow a basic tutorial beforehand to learn R’s basic syntax.

The most commonly used Graphical User Interface for R is called RStudio (https://www.rstudio.com/products/rstudio/) and I highly recommend you use this as it has nifty functionality such as syntax highlighting and auto completion which helps ease the transition from point and click to command line programming.

Basic Syntax

Once installed, launch RStudio. You should see something similar to this setup with the ‘Console’ on the left-hand side, the ‘Environment window’ on the top right and another window with several tabs (Files, Plots, Packages, Help, Viewer) on the bottom right:

Don’t worry if your screen looks slightly different, you can visit View > Panes from the top menu to change the layout of the windows.

The console area is where code is executed. Outputs and error messages are also printed here but content within this area cannot be saved. As one of the main advantages of using R is its ability to create easily reproducible outputs, let’s create a new script which we can save and rerun later. Hit CTRL+SHIFT+N to create a new script. Save this within your working directory using the save icon.

Loading Data

Let’s get on with loading our data. Type

data = read.csv(file.choose())

into the script file and again hit CTRL + Enter whilst your cursor is on the same line to run the command, you can also highlight a block of code and using CTRL + Enter to run the whole thing.

You should see a file browser window; navigate to the CSV file you saved earlier containing the FHRS data. Note the syntax of this command, it creates a variable called data on the left hand side of the equals sign and assigns it to the file loaded in using the read.csv command. Once loaded, you should see the new variable, data, appear in the environment window on the right hand side. To view the data you can double click on the variable name in the environment window and it will appear as a new tab in the left hand window. Note the variables that this data contains. The object includes useful information such as the business name, rating value, last inspection date and address.

Summary statistics

Let’s do some basic analysis. To remove any records with missing values first run the complete.cases command:

data = data[complete.cases(data),]

here we pass our data variable into complete.cases which removes any incomplete cases and overwrites our original object.

To run some basic statistics we need to convert the RatingValue variable to an integer:

data$RatingValue = strtoi(data$RatingValue,base =0L)

Note how we use the $ to access the variables of our data object.

To see the minimum and maximum rating values of food outlets in London we can use the minimum and maximum functions:

min(data$RatingValue)
max(data$RatingValue)

These commands simply give us the minimum and maximum values without any additional information. To see the full records for these particular establishments we can take a subset of our data to only include those which have been awarded a zero star rating for example:

star0 = data[which(data$RatingValue==0), ]

Creating a graph

Lastly, let’s create a barchart to look at the distribution of star ratings for food outlets in London. We will use the ggplot library, to install and then load this library, call:

install.packages(‘ggplot2’)
library(ggplot2)

To create a simple barchart use the following code:

ggplot(data = data, aes(x = RatingValue)) + geom_bar(stat = "count")

Here you can see we have passed RatingValue as the X axis variable in the ‘aesthetics’ function and passed in ‘count’ as the statistic. The output of which should look something like this:

To add x and y labels and a title to your graph use the labs command at the end of the previous line of code:

ggplot(data = data, aes(x = RatingValue)) + geom_bar(stat = "count") + labs(x = "Rating Value", y = 'Number of Food Outlets', title = 'Food Outlet Rating Values in London')


rachel oldroyd

Rachel Oldroyd is one of our UK Data Service Data Impact Fellows. Rachel is a quantitative human geographer based at the Consumer Data Research Centre (CDRC) at the University of Leeds, researching how different types of data (including TripAdvisor reviews and social media) are used to detect illness caused by contaminated food or drink.

Experimenting with AI: My Experience Creating a Closed Domain QA Chatbot

James Brill, graduate developer and Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service introduce us to the world of developing an innovative chatbot for answering research data management queries.

A chatbot is a computer program which responds to user input either by textual data (typing) or audio data (speaking). The UK Data Service wanted to utilise this emerging technology to benefit its users by reducing the response time for its research data management (RDM) queries. Louise  explains: “The idea was inspired by a presentation on ‘Cognitive systems redefining the library user interface’ presented by Kshama Parikh at the International Conference on Changing Landscapes of Science and Technology Libraries in Gandhinagar in the province of Gujarat, North West India  late last year. I envisaged a new online service: Meet Redama, our new chatbot for research data management. You can ask Redama anything about research data management, and she will be available 24/7 to provide an instant response as best she can. And, most conveniently, the night before your research application is due in to the ESRC! Go on, ask her about consent forms, data formats, and encrypting data – her breadth of expertise is excellent. However, with only 3 months to try it out, that was always a big dream!”

We hired James Brill, a recent graduate from the University of Essex for a summer project to develop a chatbot to try and solve a closed domain question answering (QA) problem, using the domain of ‘research data management’. He worked closely with his supervisor, Dr Spyros Samothrakis, Research Fellow in the School of Computer Science and Electronic Engineering. James highlights the main steps and some of the challenges that arose.

James explains: “The basic premise of a QA chatbot is to map a question to a relevant answer. This can be done in different ways from machine learning to more statistical approaches such as cosine similarity (which still have an element of machine learning to learn the initial vectors). In other words, one can draw similarities between question and answering with information retrieval. What follows is my experience of chatbot development.”

Stage 1: Sourcing and cleaning the data

As with many AI problems one first needs to have valid “real life” data which can form a useful baseline.  Here at the UK Data Service we used two sources:

  • queries related to research data management, submitted from users to the Service via our web-based Helpdesk system
  • text from our fairly substantive RDM help pages

Once the first set of data was assembled, the next step was to clean and pre-process it so that it could be fed into a mapping function which tries to predict an answer for a given question. We did this by taking existing emails and putting each individual one in its own text file. This resulted in a collection of 274 plain text files.

The same was done with all the data management web pages – each page is located in its own individual text file. This proved to be a time consuming task. Due to the initial file format of the data, both knowledge bases had to be generated by hand. Textual data often has special escape characters such as newline “\n”, or tab “\t” which need to be stripped from the text, and this was performed when separating the emails and web pages into separate files.

Once we had set up two simple knowledge bases, we then created a data management object.  This object loads all the necessary scripts and acts as a simple interface between a chatbot and the data itself. The simple script I wrote is located in Datamanager.py. This way, one can change the existing knowledge base structure underneath (to a database instead of a collection of files for example) and because the Datamanager object acts as an application programming interface (API), the chatbot itself will not require any code changes.

Step 2: Calculating a Baseline

A baseline performance has to be established to see what can be achieved using someone else’s framework, or by adopting a well-used approach. For this project, we decided to use chatterbot which we adapted to feed in questions and answers from our dataset using the DataManager object. To see the exact script refer to https://github.com/jamesb1082/ukda_bot/tree/master/bot. However, this particular chatbot did not work well with our dataset due to its relatively small size.  The examples given in the chatterbot project use rather large conversational corpora such as the Ubuntu Dialogue corpus. Having been trained on our corpus, it achieved a precision of 0 almost every time and so any model going forward would be just as good as this particular framework for our dataset. This chatbot was a good learning experience to start off the project, as it highlighted some key issues around building a chatbot, such as storing and inputting data.

Step 3: Using Machine Learning Techniques/Statistical Techniques

Having now established our baseline, we then approached the problem of designing a mapping function from a different angle: using machine learning and information retrieval techniques to generate relevant answers.  First of all, it is important to establish how similarity and relationships between words can be modelled in a computer program. The modern approach is to use Vector Space models which map each individual word to a unique point in vector space, in other words, each word is represented as a series of 100 numbers. The position of each word in vector space is relative to all other words in vector space:  words that have similar meanings are close to each other, and the resulting vector produced by the subtraction of one word’s vector to another defines the relationship between two words. A common example is King-Queen: Man – Woman where each word is actually the vector of that word. The detail of how these vectors are generated is beyond the scope of this blog, just to note that they are critically important.  Enabling words to be treated as numbers means that mathematical calculations can be performed on lexical items.

Step 4: Siamese Neural Network

Since the previous two methods performed unsatisfactorily, we adopted a different approach which centres on using “neural networks” to learn and generate a mapping function instead. As the dataset we are working with is rather small (only 171 correct QA pairs). We opted to use a Siamese Neural network (SNN) which is a special type of neural network consisting of two identical neural networks which share a set of weights. The question vector is fed into one neural network and the answer is inputted into the other network (see diagram below).

The distance between the output of the two neural networks is calculated with the idea being that the difference is 0 when the answer is correct and 1 if it is not. The weights are updated to adjust the network depending on whether the answer was right or wrong and by how much. Essentially, by training the network in this manner, we can calculate the distance between a question and an answer, which in turn acts as a distance function. This stage of the project was the hardest theoretical part of the project.  However, the actual coding was relatively straightforward, due to the very simple, modular API provided by Keras.

Step 5: Results

After implementing and training the model on our dataset, we performed some testing on it, to see how well it actually performed in different scenarios. The first test used the complete training set, to see how well it “remembered” questions, with our dataset correctly identifying 79% of questions. It is important to note that one does not want 100% at this stage, as it is a common sign that the model will have likely just memorised the initial dataset, and has not generalised the relationships between questions and answers. When we tested it on unseen questions, our model did not perform particularly well, however, we suspect that this is due to some answers only having one relevant question, meaning that it cannot generalise well.

Step 6: Further Improvements

Neural networks learn through being shown examples, and as a result, the performance of a neural network is reliant upon the quality of the dataset it is trained upon. Although deploying a very small dataset and we did upscale it to contain correct and incorrect QA pairs, it often featured only one or two correct QA pairs for certain topics. To combat this issue, one could improve the dataset by not only asking more questions but seeking a more uniform distribution of questions. For example, our distribution (see below) is not even, with some very dominant peaks and with a lot of answers which have very few answers pointing at them. We also used Stanford’s SQuAD to directly compare the model provided in this project against other chatbot model. To do this, we modified our data loading functions to read in the training data, and modified our script to output a prediction file.

James told us: “Overall, I think the project went well, with a basic interface chatbot having been created. With the possible expansion of using other datasets and some tuning of parameters, I believe the main method implemented in this project will be reasonably successful.”

James was embedded​ into the Big Data Team at the UK Data Service for the duration of the Chat Bot work. Nathan Cunningham, Service Director for Big Data, said: “It’s really important that the we start to investigate new methodologies for understanding and structuring our data challenges. Over the next couple of years we aim to try and organise our collection and resources with enough context and information to make sense to anyone reading, querying or using the Service. The Chatbot work was an initial pilot study in using Machine and Deep Learning algorithms to construct a knowledge base and answer questions. We now have better understanding of what we need to do to build a UK Data Service knowledge base and how this will improve our data services.”

James’ supervisor, Spyros says “I was delighted to work with UK Data Service in bringing this exciting project to life – a good question/answering bot can greatly decrease first response times for users queries. A number of exciting developments in Neural Networks were incorporated into the bot – the bot is not just trying to assess the existence or not of certain words, but aims at inferring the exact semantic content of the questions. Louise says: “James has moved on to the Advanced Computing – Introduction to Machine Learning, Data Mining and High Performance Computing course at the University of Bristol and we wish him well in his career. We are hopeful that we can attract another excellent Master’s student to work on implementing a chatbot for us.”

Resources:

Papers that introduce datasets:

 

Keras Tutorials:

 

What’s in there? From encoding meaning in legacy data access platforms, to frictionless data

The UK Data Service‘s Victoria Moody and Rob Dymond-Green on freeing data from legacy data access platforms.

Legacy data access platforms are beginning to offer us a window into a recent past which provides a sense of the processing and memory size limitations developers had to work with; now difficult to imagine.

These platforms offer a sense of how data access models were defined by technology that at the time looked like a limitless opportunity to construct new architectures of data access – designed to be ruthless in their pursuit of a universal standard of full query return – that is, identifying all possible query lines, and being engineered to fetch them. An unsustainable technical debt which precludes discovery and analysis by newer workflows.

Although designed for form to follow function, it is elements of the human scale which surface and make them endearing as well as brilliant in the way they approached problems they solved. They proliferated, but how many endure, especially from the late 1990s to early 2000s? The team here is interested in first generation data access platforms still in use, we’d love you to tell us about any you still use.

In the late 1990s the team making UK census data available to UK academics and researchers designed CASWEB which gives access to census aggregate data from 1971 to 2001. Casweb is nearly 20 years old and survives – it is still used by many thousands doing research using census aggregate data, returning tables which users can select by searching their preferred geography then topic.

Image: The Casweb data engine returning data in ‘steampunk’ style

InFuse – the next stage innovation to Casweb gives access to 2001 and 2011 census aggregate statistics also used by thousands each year and offers the user the chance to select  their data through the route of geography or topic.

Image: InFuse

Casweb is now a platform at risk, with the potential that it will eventually not be compatible with browsers in use; its programming language predates modern standards. Now that people can fit a ’90s supercomputer on their mobile ‘phone, the capacity for data access offers a different, more prosaic but more communal route.

Image: Legacy Casweb code – “Please install Netscape 3.0”

The data are now all open but that doesn’t mean they are discoverable or straightforward to get. We’re taking the opportunity to retrieve the data built into the table-based web pages of Casweb and freeing the data  from the prescribed routes into the data that both Casweb and Infuse offer, turning instead to routes to improve discoverability and usability – cost effectively, and aligned with the latest standards for frictionless data.

Our plan is to go back to basics, to deliver the minimum viable product, as discussed in a post by John Matthews How we learned to stop fearing technical debt and love frameworks using  DKAN.

But we need to do some work.  Jamey Hart’s post,  Making Metadata for the 1971-2001 Censuses Available,  describes a project he is working on where we are liberating metadata from Casweb and making it more usable for our users. We are going to go a step further and export the data which go with this new metadata, to create datasets to load into our shiny new dkan instance.

Luckily, the metadata in InFuse is stored within the application in an easier to digest format than Casweb, so we are also in the process of exporting this together with the data into a set of lightweight data packages. Our aim will be to format the  data and metadata from Casweb and InFuse to use the same structures, which means that our users won’t have to grapple with learning how data is structured when moving between censuses.

We’ll start by exporting data from InFuse 2011 using information from our download logs to determine which are our most popular datasets. For our initial data release we’ll be delivering 50 datasets through dkan, and will continue to release data in tranches of 50 until we’ve finished making all our data available as csv.

But we’re interested in hearing how you’d like us to approach these dataset releases. An option could be for us to release 50 datasets from 2001 which cover the same type of topics as the ones in 2011, then equivalent datasets from 1991 and back to 1971, thus building a time series as it were of data. It’s just one idea, you may of course have other ideas as to the priorities we should use when releasing this data, please get in touch and help shape how this process goes.

We’ll take care of and maintain Casweb and Infuse, but won’t add any data or develop them any further. We’ll also move away from impenetrable acronyms to name our data access platform (although ‘Cenzilla’ was mooted at one point…)

Ultimately, we aim to free the data for better discovery and, in the spirit of frictionless data, make it easy to use the data for sharing and utilisation whether that’s in CSV, Excel, R, or Hadoop.

Please let Rob know what data combinations you’re using from censuses from 1971 up to help us structure our retrieval schedule.

Engaging more people with more data: Developing the data quiz app

Ralph Cochrane and the team at AppChallenge have been working with the UK Data Service to support new ways of engaging more people with data, mainly through mobile app development, hackathons and coding competitions. Ralph  introduces the UK Data Service’s new quiz app:

Our latest initiative is the UK Data Service Quiz App (available for free on both Android and Apple devices) which builds on ideas that we crowd-sourced from developers around the world and discussions with the team at the Service. We wanted to find a way to bring some of the open data sets to life and appeal to a wider audience, not just within the data science community.

Quiz Master – the content management system

The app is driven by a content management system (built using PHP and mySQL) which is hosted on the UK Data Service infrastructure. Similar to WordPress, it’s a web based system that allows members of staff to add a new quiz, create questions and even run their own internal games to test how well the questions will be received.

Mobile App development

Developing the apps took a little more time. We focused first on the iOS app for Apple devices, which has been developed in a language called Swift provided by Apple. We’ve been impressed with the rigour with which Apple tests apps before they are published on the app store, even if it’s been painful for us at times when they’ve found little things that they consider to be a problem. If you have an iPhone why not try the UK Data Service Quiz App and tell us what you think on Twitter @AppChallenge?

Moving on to Android, we have just issued an update to the app which brings it closer to the iOS app, in particular fixing an issue with offline use when there is little or no mobile reception. There are many more Android phones out there, but that is also a double edged sword. Designing an app that will work well on a wide range of devices with different capabilities is not easy. There’s also the fact that many Android users don’t update their ‘phone regularly, so even the underlying operating system can have many variants.

Lessons learned

Creating something intuitive and fun in an app that visualises data in an easy-to-engage-with way is not easy. From our first “finished” version of the iOS app to the version in the appstore to now we’ve issued eleven minor updates. Most fixes are to the user interface or to deal with highly unlikely scenarios e.g. you start the install then lose Internet access, but we’ve learnt a lot along the way.

Engagement is harder than it sounds too. We’ve added functionality, such as being able to send all iOS users notifications e.g. “It’s the FA Cup Final today. Have you tried our football quiz?” because we noticed that new users would install the quiz app and then only use it once. There is real skill to both creating questions that are interesting and promoting the app online to generate more downloads. Perhaps the biggest lesson is that the app is merely one tool in the toolbox to reach new potential users of UK Data Service data. It works best when it is integrated with topical events and other outreach programmes e.g. via Twitter, Facebook and the blog.

Football data quiz

We’re continuing to promote the app online directly within app stores in the UK and to date we’ve had about 1000 downloads. If you know of anyone who likes general knowledge quizzes, why not ask them to have a go and give us some feedback? After all, it is free.

Mining data from Impact Case Studies in HEFCE’s REF2014 database

We wanted to mine some data from HEFCE’s API of Impact Case Studies submitted to the REF2014 to find out how many UK Data Service data collections were used in Impact Case Studies in REF2014 – and the case studies they were used in. Basic analysis of the database showed high usage of data – but very low citation using persistent identifiers (data DOIs).

John Matthews, software engineer in the census support team, talks us through the process Continue reading →