Exploring Kepler.gl

Shows the home page of the Kepler.gl site

Kepler.gl is an open source mapping tool that claims to work for large scale datasets.

It has been developed by Uber, where they have developed an in-house solution based on open source components which they use to analyse their data. Luckily for us, they decided to make their solution open source and available to us.

Kepler.gl works within your browser, which is a nice feature as it means you retain control of your data, which could be important if you wanted to map data which could contain sensitive data.

To try the system out I downloaded our 2011 Census Headcounts, in particular the file called UK postcode data and supporting metadata for 2011 frozen postcodes, which is a zip file.

I unzipped this, ready for me to load into Kepler.gl. I chose this dataset as I know it contains latitude and longitude information, as well as population and deprivation data.

Uploading data was pretty straight forward. There’s an option to browse for your data file or drag and drop the file into the browser.

A slightly annoying bit for me was that map opens focused on San Francisco, when I know the data I added was for the UK. But it was easy to refocus the map on the UK using the standard grab-and-pull functionality.

To map the data, I needed to add a layer and choose the type of data.

For this data I knew it was point data. I also entered a name of the layer. I called it 2011 Census Postcodes. It’s possible with Kepler.gl to add more than one layer so giving your new layer an meaning full name is useful.

It next asked for the fields that contain the Lat(latitude) and Lng(longitude).

In our data I discovered that we mislabelled them, so the field names were the opposite of what they should be (I’ll get this corrected).


You’ll notice that there is the option to add a field to represent the Altitude. For this initial visualisation, I left that blank.

This now created a map showing UK postcodes, but (to be honest) it was a bit boring.

Kepler.gl has the option to colour the postcode points based on the value of a field.

In this data, were the UK Townsend Deprivation scores as quintiles calculated at the output area level, so I used this field to colour-code the points. I also sized the points based on the number of people living in that postcode.

The finished map of the UK shows a very mixed view, but if you zoom into a town and city you can then see the differences between postcodes.


For example, here’s a map of Belfast showing differences in deprivation between postcodes. Dark red is less deprived and yellow is most deprived.

Overall I found this web app easy to use, but it may give some issues for people unfamiliar with mapping.

However, as a free tool to map data without sending it back to a server it presents an option to map more personal data without the worry of having this data hosted some where you don’t know.


Rob Dymond-Green is a Senior Technical Co-ordinator for the UK Data Service, working with aggregate census and international data. 

Shaping the 2021 UK Census

The Office for National Statistics (ONS) have been recently doing work on question design and gauging the effect of asking questions on topics such as Gender identity, Sexual orientation and Armed forces community.

These questions are new following feedback from an earlier topic consultation or to address a policy need. For the new armed forces question, there is the Armed Forces Covenant which is a promise from the nation that those who serve or have served in the armed forces, and their families are treated fairly. The ONS have looked at using data from the Ministry of Defence  to identify people who have served in the armed services, but as the Veterans Leavers Database only goes back to 1975, a new question is needed to identify the number of veterans who served before 1975.

National Records of Scotland (NRS) has also been looking at the topics for the 2021 Census and has published a report on their status so far. NRS are running some events to talk about how the data will be disseminated. These events are taking place in February 2018 and early March 2018.

Northern Ireland Statistics and Research Agency (NISRA) has also been looking at the topics for the 2021 census and have produced a report following their consultation activity.

A change for the 2021 census will be the push to get more respondents to fill in their census forms online.

Each of the agencies are also looking at using existing data (administrative) that is collected by government or other national bodies, which could provide accurate data to replace having to ask that question on the form.

It’s hoped that the use of administrative data could deliver a more timely census. There is also the opportunity to increase the frequency of subsequent updates of the data, possibly even annually for some topics.

The agencies are also investigating the use of systems to be able to define your own census datasets, which would be capable of aggregating the data from the unit records (individual records about people) and applying anonymisation to them as the data is generated.

If you are interested in finding out more information on the agencies plans for the 2021 census, please follow these links to their websites for 2021. ONS Census Transformation Programme , NISRA has a 2021 Census page and NRS have setup Scotland’s 2021 Census.

 

Calculating Townsend Scores: Resources to learn R

Sanah Yousaf, one of our interns talks about how she approached the task of learning R.

The current project I am working on as an intern at UK Data Service Census Support is creating Townsend deprivation scores with UK 2011 Census data. To allow UK Data Service Census Support users to produce their own Townsend deprivation scores, I used R to create an R script that produces the Townsend deprivation scores.

Initial experience with R

My first experience with R came about during my degree in a module concerning data analysis. R is a language and environment for statistical computing and graphics. The software is free and allows for data manipulation, calculation and graphical display.

There are many packages available to use in R and I only had experience with using “R Commander” and “Deducer”. Both R Commander and Deducer are data analysis graphical user interfaces. Knowledge of coding in R is not particularly necessary when using R Commander and Deducer as menu options are available for easy navigation to get to what you need, whether that’s obtaining summary statistics of your data or creating graphs.

Perhaps it is fair to say that the R skills set I gained from my module as part of my degree was limited for the task at hand. Having said that, I was both eager and intrigued to learn more about R’s capabilities but I would have to do this quickly to enable me to produce an R script to create Townsend scores.

Useful resources to learn R

After a few Google searches, I found many resources that taught the basics of R. Some were great, others not so much. I particularly found R tutorials on YouTube useful as opposed to some websites that provided code for certain functions in R but lacked explanations. In addition, I often found myself on Stack Overflow which is an online community for developers to learn and share their knowledge of different programmes.

R Tutorials on YouTube

If you are new to R, I would recommend “MarinStatsLectures” channel on YouTube.  The channel has tutorials ranging from how to import data into R that is of different formats to working with data in R. There are over 50 tutorials on the channel that are no longer than 10 minutes in length.  The tutorials provided me with knowledge of different R commands and explained basic R concepts well.

R packages

The R package “Swirl” allows R users to interactively learn through the R console. This was useful because I could learn different R commands whilst practicing within R.

Google search

A simple Google search of “how to… in R?” will usually provide you with the answer you are looking for! You will most probably bump into other R users who have asked the same question on Stack Overflow.

Ask R for help in the R Console

The help() or ?() command typed into the R Console will bring up R Documentation in the help window in R Studio. For example, typing in ?matrix in the R Console should load up the R documentation below.

References

More about R: https://www.r-project.org/about.html

Downloading R: https://cran.r-project.org/bin/windows/base/

Downloading R Studio: https://www.rstudio.com/products/rstudio/download/

MarinStatsLectures Channel on YouTube: https://www.youtube.com/user/marinstatlectures

More about Swirl in R: http://swirlstats.com/

Stack Overflow: https://stackoverflow.com/questions/tagged/r

 

 

 

 

 

 

 

Calculating Townsend scores: Replicating published results

Amy Bonsall, one of our interns talks about how she approached the task of working out how to calculate Townsend scores and then of finding others work to compare against as a way to quality assure the methodology.

As part of the internship project to calculate deprivation scores after finding sources that provide an outline of how to calculate Townsend Deprivation Scores it was important to ensure the methodology would produce scores that matched those already published.

We wanted to calculate scores and compare them to those that had already been calculated by using the same dataset to be sure we were using the same methodology. Whilst I was focused on this, Sanah Yousaf, my partner in this internship, was creating an R script to calculate the scores. Whilst this was being developed I used Excel to calculate the scores. This was not only because we did not yet have an R script but also because I was already comfortable with Excel and it made it easy to visualise the results of each step in the calculation.

Replicating scores proved more difficult than anticipated. Not only were there limited resources of published scores but we also found that many of the people who had already calculated scores had access to unadjusted census data meaning we had different outcomes. The main problem here was, there was no way of knowing if the different data was the only reason for contrasting scores or if it could have also been down to a different formula.

I went through what felt like an endless number of attempts to replicate another’s scores. Each time I would attempt to follow the often-limited detail of the methodology. Each time I failed I’d attempt a slight variation in the calculation to see if this would work with no success. Eventually, I found a source of results calculated for 1991 by Paul Norman. Included with the results was the data used to calculate the scores as well as the Z scores for each of the indicators. The materials provided with these scores were very useful as I could ensure the scores were the same based on the exact same data. It also meant that I could check if the z scores were right before ensuring that the Townsend Deprivation Scores were correct. Success was found with this dataset and meant I could go onto calculating deprivation scores for 2011 knowing that the calculation would be correct.

The next step meant creating scores based on datasets at varied output areas, which was much easier than the previous task. After my partner in the internship, Sanah had created an R script allowing us to calculate the scores, getting results didn’t take long. From here it will be interesting to see any other obstacles that we may come across including mapping the results and comparing them to past censuses. Considering the process so far however, I look forward to confronting them face on.

 

Calculating Townsend scores: An introduction

Amy Bonsall one of our interns talks about what deprivation is and how it could be calculated.

As a student at the University of Manchester studying criminology I was lucky enough to get the opportunity to work on a project with the UK Data Service as an intern calculating Townsend Deprivation Scores for the UK and importantly, learning work environment skills that will be useful once I graduate. My fellow intern (Sanah) and I came with a thirst to learn and an ambition to make the project a success which has made the exciting aspects more rewarding and the obstacles we need to come much more bearable.

Deprivation is a lack of reasonable provisions. This could be in a social way or material. Because there are so many indicators of deprivation and it cannot be measured in one objective way as it is a construct so many deprivation indices have been developed. Each of these indices have their benefits for measuring deprivation as well as areas where they are lacking.

Different methods of calculating have been developed due to a long term need to research deprivation through census data and the ever-changing indications of deprivation. I am currently using the 2011 census to calculate deprivation scores for the UK using the Townsend index. This is just one of many ways deprivation can be calculated however, we have decided this one is appropriate as it measures material deprivation exclusively rather than incorporating social deprivation meaning it can be consistently calculated over time. It is also comparable across the UK.

Before jumping into the data and calculating the deprivation scores it was important to first understand what Townsend’s Index measures and how to measure it. Information on the index was readily available and easy to find giving the initial feeling that the resources required at each stage of the project would be easily found (they weren’t).

Research taught us that Townsend Deprivation scores are calculated based on 4 indicators of deprivation: non-home ownership, non-car ownership, unemployment and overcrowding.

This is calculated by first finding percentage non-car ownership, percentage non- home ownership, percentage unemployment and percentage overcrowding.
The percentages for each area then need to be normalised for the unemployment and overcrowding indicators as these results are very skewed this is done by: ln(percentage value +1).

Z scores are then calculated using the percentage values for each ward under each indicator. For the unemployment and overcrowding variables, the logged versions are used instead.
Z scores= (percentage – mean of all percentages)/ SD of all percentages
Z scores of logged variables= (log percentage – mean of log percentages)/ SD of log percentages

Total of the 4 Z scores= Townsend Deprivation Score

 

Through the sources found it wasn’t perfectly clear how to calculate Z scores from the logged variables. There was no clarification about whether to take the mean and standard deviation of the percentages after they are logged or before. Taking information from different sources gave a good idea of the correct formula, however, the important next step is to test this formula against existing scores to ensure it is correct before continuing the process of this project.

Creating Consistent Deprivation Measures Across the UK

We’ve been lucky enough to have two interns come and work with us over the summer. They have been working on creating a set of Townsend Deprivation scores, using the UK 2011 Census data we have available via InFuse.

The interns came to us through the University of Manchester Q-Step Centre, which coordinates with different types of workplaces to offer 2nd year students the chance to practice the data skills taught through their degree courses at the university.

Sanah Yousaf is studying Law with Criminology.

I am currently a student at the University of Manchester studying Law with Criminology. As part of my degree, I chose a module called Data Analysis for Criminologists which exposed me to the world of data. I enjoyed the course so much that I decided to apply to work as an intern at UK Data Service via the Q-Step internship programme offered at the University of Manchester. As a result, I am now an intern at UK Data Service, specifically in the Census Support team based in Manchester. The project I am working on with my fellow intern (Amy) is calculating Townsend deprivation scores for the UK 2011 Census data.

 

Amy Bonsall is studying Criminology.

As a student at the University of Manchester studying criminology I was lucky enough to get the opportunity to work on a project with the UK Data Service as an intern calculating Townsend Deprivation Scores for the UK and importantly, learning work environment skills that will be useful once I graduate. My fellow intern (Sanah) and I came with a thirst to learn and an ambition to make the project a success which has made the exciting aspects more rewarding and the obstacles we need to come much more bearable.

Amy and Sanah have agreed to write blogs about the project, which we’ll publish over the coming weeks, together with the resources that Amy and Sanah created, to include the raw data and the scores.

Experimenting with AI: My Experience Creating a Closed Domain QA Chatbot

James Brill, graduate developer and Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service introduce us to the world of developing an innovative chatbot for answering research data management queries.

A chatbot is a computer program which responds to user input either by textual data (typing) or audio data (speaking). The UK Data Service wanted to utilise this emerging technology to benefit its users by reducing the response time for its research data management (RDM) queries. Louise  explains: “The idea was inspired by a presentation on ‘Cognitive systems redefining the library user interface’ presented by Kshama Parikh at the International Conference on Changing Landscapes of Science and Technology Libraries in Gandhinagar in the province of Gujarat, North West India  late last year. I envisaged a new online service: Meet Redama, our new chatbot for research data management. You can ask Redama anything about research data management, and she will be available 24/7 to provide an instant response as best she can. And, most conveniently, the night before your research application is due in to the ESRC! Go on, ask her about consent forms, data formats, and encrypting data – her breadth of expertise is excellent. However, with only 3 months to try it out, that was always a big dream!”

We hired James Brill, a recent graduate from the University of Essex for a summer project to develop a chatbot to try and solve a closed domain question answering (QA) problem, using the domain of ‘research data management’. He worked closely with his supervisor, Dr Spyros Samothrakis, Research Fellow in the School of Computer Science and Electronic Engineering. James highlights the main steps and some of the challenges that arose.

James explains: “The basic premise of a QA chatbot is to map a question to a relevant answer. This can be done in different ways from machine learning to more statistical approaches such as cosine similarity (which still have an element of machine learning to learn the initial vectors). In other words, one can draw similarities between question and answering with information retrieval. What follows is my experience of chatbot development.”

Stage 1: Sourcing and cleaning the data

As with many AI problems one first needs to have valid “real life” data which can form a useful baseline.  Here at the UK Data Service we used two sources:

  • queries related to research data management, submitted from users to the Service via our web-based Helpdesk system
  • text from our fairly substantive RDM help pages

Once the first set of data was assembled, the next step was to clean and pre-process it so that it could be fed into a mapping function which tries to predict an answer for a given question. We did this by taking existing emails and putting each individual one in its own text file. This resulted in a collection of 274 plain text files.

The same was done with all the data management web pages – each page is located in its own individual text file. This proved to be a time consuming task. Due to the initial file format of the data, both knowledge bases had to be generated by hand. Textual data often has special escape characters such as newline “\n”, or tab “\t” which need to be stripped from the text, and this was performed when separating the emails and web pages into separate files.

Once we had set up two simple knowledge bases, we then created a data management object.  This object loads all the necessary scripts and acts as a simple interface between a chatbot and the data itself. The simple script I wrote is located in Datamanager.py. This way, one can change the existing knowledge base structure underneath (to a database instead of a collection of files for example) and because the Datamanager object acts as an application programming interface (API), the chatbot itself will not require any code changes.

Step 2: Calculating a Baseline

A baseline performance has to be established to see what can be achieved using someone else’s framework, or by adopting a well-used approach. For this project, we decided to use chatterbot which we adapted to feed in questions and answers from our dataset using the DataManager object. To see the exact script refer to https://github.com/jamesb1082/ukda_bot/tree/master/bot. However, this particular chatbot did not work well with our dataset due to its relatively small size.  The examples given in the chatterbot project use rather large conversational corpora such as the Ubuntu Dialogue corpus. Having been trained on our corpus, it achieved a precision of 0 almost every time and so any model going forward would be just as good as this particular framework for our dataset. This chatbot was a good learning experience to start off the project, as it highlighted some key issues around building a chatbot, such as storing and inputting data.

Step 3: Using Machine Learning Techniques/Statistical Techniques

Having now established our baseline, we then approached the problem of designing a mapping function from a different angle: using machine learning and information retrieval techniques to generate relevant answers.  First of all, it is important to establish how similarity and relationships between words can be modelled in a computer program. The modern approach is to use Vector Space models which map each individual word to a unique point in vector space, in other words, each word is represented as a series of 100 numbers. The position of each word in vector space is relative to all other words in vector space:  words that have similar meanings are close to each other, and the resulting vector produced by the subtraction of one word’s vector to another defines the relationship between two words. A common example is King-Queen: Man – Woman where each word is actually the vector of that word. The detail of how these vectors are generated is beyond the scope of this blog, just to note that they are critically important.  Enabling words to be treated as numbers means that mathematical calculations can be performed on lexical items.

Step 4: Siamese Neural Network

Since the previous two methods performed unsatisfactorily, we adopted a different approach which centres on using “neural networks” to learn and generate a mapping function instead. As the dataset we are working with is rather small (only 171 correct QA pairs). We opted to use a Siamese Neural network (SNN) which is a special type of neural network consisting of two identical neural networks which share a set of weights. The question vector is fed into one neural network and the answer is inputted into the other network (see diagram below).

The distance between the output of the two neural networks is calculated with the idea being that the difference is 0 when the answer is correct and 1 if it is not. The weights are updated to adjust the network depending on whether the answer was right or wrong and by how much. Essentially, by training the network in this manner, we can calculate the distance between a question and an answer, which in turn acts as a distance function. This stage of the project was the hardest theoretical part of the project.  However, the actual coding was relatively straightforward, due to the very simple, modular API provided by Keras.

Step 5: Results

After implementing and training the model on our dataset, we performed some testing on it, to see how well it actually performed in different scenarios. The first test used the complete training set, to see how well it “remembered” questions, with our dataset correctly identifying 79% of questions. It is important to note that one does not want 100% at this stage, as it is a common sign that the model will have likely just memorised the initial dataset, and has not generalised the relationships between questions and answers. When we tested it on unseen questions, our model did not perform particularly well, however, we suspect that this is due to some answers only having one relevant question, meaning that it cannot generalise well.

Step 6: Further Improvements

Neural networks learn through being shown examples, and as a result, the performance of a neural network is reliant upon the quality of the dataset it is trained upon. Although deploying a very small dataset and we did upscale it to contain correct and incorrect QA pairs, it often featured only one or two correct QA pairs for certain topics. To combat this issue, one could improve the dataset by not only asking more questions but seeking a more uniform distribution of questions. For example, our distribution (see below) is not even, with some very dominant peaks and with a lot of answers which have very few answers pointing at them. We also used Stanford’s SQuAD to directly compare the model provided in this project against other chatbot model. To do this, we modified our data loading functions to read in the training data, and modified our script to output a prediction file.

James told us: “Overall, I think the project went well, with a basic interface chatbot having been created. With the possible expansion of using other datasets and some tuning of parameters, I believe the main method implemented in this project will be reasonably successful.”

James was embedded​ into the Big Data Team at the UK Data Service for the duration of the Chat Bot work. Nathan Cunningham, Service Director for Big Data, said: “It’s really important that the we start to investigate new methodologies for understanding and structuring our data challenges. Over the next couple of years we aim to try and organise our collection and resources with enough context and information to make sense to anyone reading, querying or using the Service. The Chatbot work was an initial pilot study in using Machine and Deep Learning algorithms to construct a knowledge base and answer questions. We now have better understanding of what we need to do to build a UK Data Service knowledge base and how this will improve our data services.”

James’ supervisor, Spyros says “I was delighted to work with UK Data Service in bringing this exciting project to life – a good question/answering bot can greatly decrease first response times for users queries. A number of exciting developments in Neural Networks were incorporated into the bot – the bot is not just trying to assess the existence or not of certain words, but aims at inferring the exact semantic content of the questions. Louise says: “James has moved on to the Advanced Computing – Introduction to Machine Learning, Data Mining and High Performance Computing course at the University of Bristol and we wish him well in his career. We are hopeful that we can attract another excellent Master’s student to work on implementing a chatbot for us.”

Resources:

Papers that introduce datasets:

 

Keras Tutorials:

 

What’s in there? From encoding meaning in legacy data access platforms, to frictionless data

The UK Data Service‘s Victoria Moody and Rob Dymond-Green on freeing data from legacy data access platforms.

Legacy data access platforms are beginning to offer us a window into a recent past which provides a sense of the processing and memory size limitations developers had to work with; now difficult to imagine.

These platforms offer a sense of how data access models were defined by technology that at the time looked like a limitless opportunity to construct new architectures of data access – designed to be ruthless in their pursuit of a universal standard of full query return – that is, identifying all possible query lines, and being engineered to fetch them. An unsustainable technical debt which precludes discovery and analysis by newer workflows.

Although designed for form to follow function, it is elements of the human scale which surface and make them endearing as well as brilliant in the way they approached problems they solved. They proliferated, but how many endure, especially from the late 1990s to early 2000s? The team here is interested in first generation data access platforms still in use, we’d love you to tell us about any you still use.

In the late 1990s the team making UK census data available to UK academics and researchers designed CASWEB which gives access to census aggregate data from 1971 to 2001. Casweb is nearly 20 years old and survives – it is still used by many thousands doing research using census aggregate data, returning tables which users can select by searching their preferred geography then topic.

Image: The Casweb data engine returning data in ‘steampunk’ style

InFuse – the next stage innovation to Casweb gives access to 2001 and 2011 census aggregate statistics also used by thousands each year and offers the user the chance to select  their data through the route of geography or topic.

Image: InFuse

Casweb is now a platform at risk, with the potential that it will eventually not be compatible with browsers in use; its programming language predates modern standards. Now that people can fit a ’90s supercomputer on their mobile ‘phone, the capacity for data access offers a different, more prosaic but more communal route.

Image: Legacy Casweb code – “Please install Netscape 3.0”

The data are now all open but that doesn’t mean they are discoverable or straightforward to get. We’re taking the opportunity to retrieve the data built into the table-based web pages of Casweb and freeing the data  from the prescribed routes into the data that both Casweb and Infuse offer, turning instead to routes to improve discoverability and usability – cost effectively, and aligned with the latest standards for frictionless data.

Our plan is to go back to basics, to deliver the minimum viable product, as discussed in a post by John Matthews How we learned to stop fearing technical debt and love frameworks using  DKAN.

But we need to do some work.  Jamey Hart’s post,  Making Metadata for the 1971-2001 Censuses Available,  describes a project he is working on where we are liberating metadata from Casweb and making it more usable for our users. We are going to go a step further and export the data which go with this new metadata, to create datasets to load into our shiny new dkan instance.

Luckily, the metadata in InFuse is stored within the application in an easier to digest format than Casweb, so we are also in the process of exporting this together with the data into a set of lightweight data packages. Our aim will be to format the  data and metadata from Casweb and InFuse to use the same structures, which means that our users won’t have to grapple with learning how data is structured when moving between censuses.

We’ll start by exporting data from InFuse 2011 using information from our download logs to determine which are our most popular datasets. For our initial data release we’ll be delivering 50 datasets through dkan, and will continue to release data in tranches of 50 until we’ve finished making all our data available as csv.

But we’re interested in hearing how you’d like us to approach these dataset releases. An option could be for us to release 50 datasets from 2001 which cover the same type of topics as the ones in 2011, then equivalent datasets from 1991 and back to 1971, thus building a time series as it were of data. It’s just one idea, you may of course have other ideas as to the priorities we should use when releasing this data, please get in touch and help shape how this process goes.

We’ll take care of and maintain Casweb and Infuse, but won’t add any data or develop them any further. We’ll also move away from impenetrable acronyms to name our data access platform (although ‘Cenzilla’ was mooted at one point…)

Ultimately, we aim to free the data for better discovery and, in the spirit of frictionless data, make it easy to use the data for sharing and utilisation whether that’s in CSV, Excel, R, or Hadoop.

Please let Rob know what data combinations you’re using from censuses from 1971 up to help us structure our retrieval schedule.

Engaging more people with more data: Developing the data quiz app

Ralph Cochrane and the team at AppChallenge have been working with the UK Data Service to support new ways of engaging more people with data, mainly through mobile app development, hackathons and coding competitions. Ralph  introduces the UK Data Service’s new quiz app:

Our latest initiative is the UK Data Service Quiz App (available for free on both Android and Apple devices) which builds on ideas that we crowd-sourced from developers around the world and discussions with the team at the Service. We wanted to find a way to bring some of the open data sets to life and appeal to a wider audience, not just within the data science community.

Quiz Master – the content management system

The app is driven by a content management system (built using PHP and mySQL) which is hosted on the UK Data Service infrastructure. Similar to WordPress, it’s a web based system that allows members of staff to add a new quiz, create questions and even run their own internal games to test how well the questions will be received.

Mobile App development

Developing the apps took a little more time. We focused first on the iOS app for Apple devices, which has been developed in a language called Swift provided by Apple. We’ve been impressed with the rigour with which Apple tests apps before they are published on the app store, even if it’s been painful for us at times when they’ve found little things that they consider to be a problem. If you have an iPhone why not try the UK Data Service Quiz App and tell us what you think on Twitter @AppChallenge?

Moving on to Android, we have just issued an update to the app which brings it closer to the iOS app, in particular fixing an issue with offline use when there is little or no mobile reception. There are many more Android phones out there, but that is also a double edged sword. Designing an app that will work well on a wide range of devices with different capabilities is not easy. There’s also the fact that many Android users don’t update their ‘phone regularly, so even the underlying operating system can have many variants.

Lessons learned

Creating something intuitive and fun in an app that visualises data in an easy-to-engage-with way is not easy. From our first “finished” version of the iOS app to the version in the appstore to now we’ve issued eleven minor updates. Most fixes are to the user interface or to deal with highly unlikely scenarios e.g. you start the install then lose Internet access, but we’ve learnt a lot along the way.

Engagement is harder than it sounds too. We’ve added functionality, such as being able to send all iOS users notifications e.g. “It’s the FA Cup Final today. Have you tried our football quiz?” because we noticed that new users would install the quiz app and then only use it once. There is real skill to both creating questions that are interesting and promoting the app online to generate more downloads. Perhaps the biggest lesson is that the app is merely one tool in the toolbox to reach new potential users of UK Data Service data. It works best when it is integrated with topical events and other outreach programmes e.g. via Twitter, Facebook and the blog.

Football data quiz

We’re continuing to promote the app online directly within app stores in the UK and to date we’ve had about 1000 downloads. If you know of anyone who likes general knowledge quizzes, why not ask them to have a go and give us some feedback? After all, it is free.

Mining data from Impact Case Studies in HEFCE’s REF2014 database

We wanted to mine some data from HEFCE’s API of Impact Case Studies submitted to the REF2014 to find out how many UK Data Service data collections were used in Impact Case Studies in REF2014 – and the case studies they were used in. Basic analysis of the database showed high usage of data – but very low citation using persistent identifiers (data DOIs).

John Matthews, software engineer in the census support team, talks us through the process Continue reading →