Hacking data on nutrition and greenhouse gas emissions

Last week, I was lucky enough to be able to join a group of data geeks (I say that affectionately!) as they gathered in Manchester to explore two very different datasets.

The first was the National Diet and Nutrition Survey (NDNS – available from the UK Data Service), which collects information on the food consumption, nutrient intake and nutritional status of the general population aged 1.5 years and over living in private households in the UK, based on around one thousand representative people.

The second was a Greenhouse Gas Emissions (GHGE) dataset created by researchers Neil Chalmers, Ruth Slater and Leone Craig at The Rowlett Institute (University of Aberdeen) which has the current best approximations of emissions for each food product in the NDNS.

The evening was organised by members of the Greenhouse Gas and Dietary choices Open source Toolkit (GGDOT) project funded by N8 Agrifood: Sarah Bridle, Christian Reynolds, Joe Fennell and Ximena Schmidt.

There was a good turnout with people breaking into three groups to explore the data and see what they could come up with.

Group 1

Group 1 used R to read in the CSV files and then grouped data by user id and day number, as well as by aggregate CO2 emissions and calories per person per day.

They then plotted these, attempting to find out whether age predicted CO2 emission levels.

They reported that their linear model was not conclusive…

 

Group 2

Group 2’s question was “Who makes you kill the planet?

Digging down into this was an exploration of whether who you eat food with affects the planet – or your waistline.

The group plotted 16 categories of who survey respondents ate with, based around a notional evening meal period of 5pm to 8pm).

The initial plot suggests that you shouldn’t eat with people you don’t know…

This applied both for calories (figure below) and  for greenhouse gas emissions.

Putting them together showed a similar pattern, although it did look like the public aren’t all bad…

A curious discovery was also made during the group’s hack.

Apparently, people eat less cheese on a Sunday…

Group 3

Group 3 reported that they had attempted data visualisation by geography, although they ran out of time to complete what they had been aiming for:

It was a lot of fun seeing people get together to explore different datasets. It will be interesting to see what future GGDOT Hacknights throw up, as well as how these datasets might be used to attempt to change consumer behaviour.

The next two GGDOT Hacknights will take place in Durham on 18th October and York on 29th November.

 

Rabia and Carstairs

Rabia Butt, one of our summer Q-Step interns, explains the process she went through to calculate and map the Carstairs deprivation index using UK Census data.

After downloading the data needed for the research to measure deprivation, some issues were recognised. Therefore, we needed to make some changes.

When trying to match variables with those that were used by published research papers, it was quite a challenge. This is because even though the research papers used Carstairs methods where by definition the census variables needed to refer to households. However, the social class had to be changed from head of household in a lower social class to per person because the data from Scotland did not include household references for social class.

To make comparisons across the UK, the per person variable was employed and to replicate our calculations with the research papers as well. There was no other research paper that calculated the whole of the UK. Most research papers used Scotland census only and some were of England and Wales together but not with the other countries in the UK.

For our investigation, the calculations for the Carstairs scores were performed in R, which required the import of a dataset with the relevant variables for the calculation.  The reason for choosing R for the analysis is the large quantity of data we were handling and the numbers of calculation we needed to make. R has the capabilities of being able to produce results for large quantity of data, which makes R irreplaceable for this sort of research.

The variables required for the Carstairs score had different names in the dataset for each census years.  For example, in the overcrowded variable, total households in 2011 was named ‘all household’ whereas, in 2001 it was given a codename ‘cs0520001’. As a result, we decided to change the names so that it was consistence throughout our calculations.

The geographical names of the variables also had different headings and codes. For example, what 2011 called the ‘GEO_CODE’, 2011 called the ‘Zone. Code’. For these reasons, the script was altered to adapt to these changes. Although these are minor adjustments, they were compulsory steps which needed to be made for the R scripts to function without any difficulties.

I have learnt to use R to calculate the proportions, mean, standard deviation and Zscores for my project and perform many other functions as well.

The calculations for the project were completed and the next step was trying to match the Zscores with the published results to ensure that the variables used in calculating scores were correct and to support out research project as well.  We discovered that the variables and geographical areas of the different census that we used were correct but Some research papers’ Zscores were different from ours as they did not include information on how they weighted the population count. After reading many research papers and trying to match their results with ours, we finally found one paper whose scores match our very closely.

The Carstairs Z-scores were spilt into quintiles, ranging from

  • 1 least deprived to
  • 5 most deprived

A quintile is a statistical value of a data set that signifies 20% of a given population, so the first quintile symbolises the lowest fifth of the data and the second quintile represents the second fifth and so on. It should be acknowledged that the quintiles in this project are based on area meaning that 20% of all areas fall into each quintile. We calculated the quintiles in R, and the formula divided the fifth quintile’s sum by the data-set sum. The reason for creating quintiles was for map visualisation, which allowed us to make observation on which areas are most and least deprived across the selected geographical province in the UK. Therefore, quintiles were very convenient for plotting map visualisations as they offer a geographical perspective of the spread of deprivation across the UK.

QGIS is software that allows users to examine and edit spatial information, as well as composing and exporting graphical maps. For our research, this software was used to generate 3D maps of the Carstairs scores for data visualisation.

At first, I practised on QGIS using the census data that interested me, so I chose to use the Pakistan census data of 2017 and 1998 to look at the difference in the gender of Pakistan in 3D maps. From the UK census of 2011 I explored how language proficiency in English could have an influence on general health. The 3D maps with my data were able to show me which places had the highest peaks and where it was the lowest.

For the main project, we created 3D maps of the whole of the UK at ward, Ouput Area and local authority level from 1991 to 2011. 3D maps of Great Britain (i.e. England, Wales and Scotland) were also created at District level.

To produce Carstairs scores from the first census of 1971 to the most recent one of 2011, Northern Ireland could not be included in the calculation, as there is no census data available for Northern Ireland in the years for 1981 and 1971. As a result, Northern Ireland was excluded in this analysis.

For detailed examination of deprivation, 3D maps of the capital cities of the UK and Greater Manchester of output areas were created as well. The boundary data for the maps were downloaded from Casweb and borders from the UK Data Service. The boundary data was simplified on mapshaper and the quintiles were dissolved into 5 layers, so that it was easier for people to understand and produce the 3D maps.

The main findings of this research are portrayed in Tables 1 and 2. Table 1 shows the most deprived areas in the UK, while Table 2 shows the least deprived areas based on the area’s total score. The results are listed for all the years and output level that the data was available for. The least deprived areas, as can be seen in Table 2, are mostly located around London, in the South of England. This finding has been consistent since 1981 till 2011, however, in 1971 Bearsden was the least deprived which is in Scotland.

To understand the overall change in the level of deprivation across Great Britain, maps for 1981 1991, 2001 and 2011 were created using quintiles. The lighter colours represent less deprived areas and vice versa. I

Based on these scores, deprivation in Great Britain has decreased greatly between 1981 and 2011. The largest change has occurred in Scotland, compared with the rest of Great Britain. However, when comparing Scotland with England and Wales, it is still more deprived.

There has been a positive change for England as well, however, it has not been into such extent as for the other countries. The north of England and Cornwall have improved their deprivation scores the most.

Nevertheless, cities in the North, the areas in Birmingham and in London continue to score highly in their deprivation scores.  There is a trend emerging throughout the four censuses of GB, which is that generally the south has always been less deprived when compared to the North.

Great Britain 1981

Great Britain 1991

Great Britain 2001

Great Britain 2011

 

Below are maps of Manchester and Greater Manchester showing deprivation at Output Area level.

Darker colours represent more deprived areas (higher quintile). Most of Greater Manchester, and especially Manchester is greatly deprived with some exceptions. There is a pattern as well, which illustrates that the inner part of all towns in Greater Manchester are deprived and the outskirts are lighter which means less deprived.

Manchester by Output Area 2011

Greater Manchester by OA 2011

 

It needs to be taken into consideration that one of the indicators of deprivation for the Carstairs score is ‘Car ownership’. Considering that owning a car in the city centre might not be convenient, it is to be expected that city centre areas may score higher on this variable resulting in the overall score being higher.

My 3D maps above also correspond with the results of the Index of Multiple Deprivation scores, which have stated that “Manchester is one of the local authority districts which has the largest proportions of highly deprived neighbourhoods in England” (The English Indices of Deprivation 2015, Baljit Gill).

The next stage is using the 3D maps produced to present them in VR. Virtual Reality is defined as “the use of computer technology to create a simulated environment”. Although VR is still in its developing stages, but it is being used across various platforms and presenting data is one of them.

Klara and Carstairs

It has been seven weeks since I started working at the UK Data Service/Jisc.

Although it seems like yesterday since I first entered the office, I have learned so many new things and skills over the past weeks, and I have finally applied what I have been learning at the University for two years.

My time here at Jisc and the UK Data Service has been one of the most valuable experiences in my life and has convinced me that work with data and data analysis is the right path for me once I graduate.

My fellow intern Rabia and I at the start of our internship

For the first few weeks, I was getting familiar with the Carstairs index (more about Carstairs can be read in my previous blog post), and was accessing all the data needed for the calculations. I encountered several problems during this process, which in turn helped me to gain even more experience and new skills.

One of the issues I had was that I needed data across five different Census years and for the whole of UK, which were not all available.

For example, one of the indicators for Carstairs score is ‘low social class of household reference person’. This information was not available for Scotland in the 2001 Census, and so the definition of the indicator had to be changed to ‘low social class of all persons’ in order to proceed with the research. This subsequently resulted in us needing to redownload all the data for the other years, and very “messy” folders with lots and lots of data we did not need anymore.

At that point I realised it was essential for me to organise all my folders and documents, and to name them with sensible names so I was sure I knew what each document contained and where to find it. Even though I have always considered myself as an organised person, working with such large amounts of data showed me I had to be even more organised to be able to progress to the analysis per se.

Another issue I faced was that different research papers used different Census variables to calculate the Carstairs score.

This was caused by the papers focusing either on one country rather than the whole of the UK or a specific Census year. Therefore, they all used the variables that were available for their particular projects. This, however, meant that Rabia (my fellow intern) and I had to adjust the definitions of the variables, so they fitted our own needs. We were lucky enough to find a few researchers who used indicators, which were available for all Census years we were analysing, and for all the countries in the UK, and so we were able to support our decisions on the definition changes.

I managed to solve all those problems and moved onto calculating the scores and analysing the data. This was done in R, which saved me a lot of time and enabled easy replicability. This was extremely useful as I had to recalculate my scores multiple times due to the problems outlined above. I have always been keen on working with R as I find it very intuitive and yet challenging, which I enjoy. I feel I have become proficient in R and I am very confident using it in a work environment.

I have always enjoyed working with data, but regardless, the analysis gets a bit more exciting when you can finally see a story the data tells.

To do this, I uploaded all my results into QGIS, an open source Geographic Information System, which enables creations of maps. All the data I have worked with concerns the geographical areas in the UK, hence mapping deprivation as well as providing tables with specific scores has felt to be the right (and very visually appealing) choice.

I have learned how to use a lot of functions QGIS offers, including creating 3D maps and analysing geospatial data.

Working with the software has woken up a very creative side of me I never knew I even had. I discovered that data analysis can be very artistic and original, especially when it comes to presenting the findings.

Deprivation in the UK from 2011 to 1991, by local authority

One of the things I have found out by just looking at my maps was that deprivation decreased massively between 1991 and 2011 according to the Carstairs index.

The darker the area, the more deprived the local authority is and vice versa.

When looking at the raw numbers, one would have no idea about this unless doing some further analysis, and so plotting numbers into a map is a great way of finding out whether there is something going on and subsequently go on about the specific analysis.

As I saw that the older maps are much darker than the ones from more recent years, I decided to explore whether this trend is significant.

I created confidence intervals and boxplots, presented below, to support my initial hypothesis about deprivation getting lower. The confidence intervals do not overlap and so I could be confident that deprivation decreased significantly between 1991 and 2011 in the UK.

Boxplots and 95% confidence intervals for deprivation levels in the UK from 2011 to 1991

Currently, I am working closely with Matt Ramirez, Futures senior innovation developer here at Jisc, who is turning my 3D maps into a virtual reality environment.

I have been given the opportunity to think about interesting and unique ways of using VR to display my results. I was particularly excited about of my ideas consisting of a lift, which would take the users up to different Census years. They could then get out of the lift and move around the map, which would result in a great interaction between the data and the users, and thus enhanced learning.

Unfortunately, given the short time frame this was not possible to complete, and so we had to simplify the visualisations and not use the lift. Regardless of that, the idea may be implemented in the future as it requires greater amount of time than we have at the moment but would be worth trying out.

Analysing Food Hygiene Rating Scores in R: a guide

Rachel Oldroyd, one of the UK Data Service Data Impact Fellows, takes a step-by-step approach to using R and RStudio to analyse Food Hygiene Rating Scores.

Data download and Preparation

In this tutorial we will look at generating some basic statistics in R using a subset of the Food Hygiene Rating Scores dataset provided by the Food Standards Agency (FSA).

Visit http://ratings.food.gov.uk/open-data/en-GB now and download the data for an area you are interested in. I’ve downloaded City of London Corporation.

R is able to parse XML files but it’s easier to load the file into Excel (or a similar package) and save as a CSV file (visit this page if you’re unsure how to do this: https://support.office.com/en-us/article/import-xml-data-6eca3906-d6c9-4f0d-b911-c736da817fa4).

R and RStudio

R is a statistical programming language and data environment.

Unlike other statistics software packages (such as SPSS and Stata) which have point and click interfaces, R runs from the command line. The main advantage of using the command line is that scripts can be saved and quickly rerun, promoting reproducible outputs. If you’re completely new to R, you may want to follow a basic tutorial beforehand to learn R’s basic syntax.

The most commonly used Graphical User Interface for R is called RStudio (https://www.rstudio.com/products/rstudio/) and I highly recommend you use this as it has nifty functionality such as syntax highlighting and auto completion which helps ease the transition from point and click to command line programming.

Basic Syntax

Once installed, launch RStudio. You should see something similar to this setup with the ‘Console’ on the left-hand side, the ‘Environment window’ on the top right and another window with several tabs (Files, Plots, Packages, Help, Viewer) on the bottom right:

Don’t worry if your screen looks slightly different, you can visit View > Panes from the top menu to change the layout of the windows.

The console area is where code is executed. Outputs and error messages are also printed here but content within this area cannot be saved. As one of the main advantages of using R is its ability to create easily reproducible outputs, let’s create a new script which we can save and rerun later. Hit CTRL+SHIFT+N to create a new script. Save this within your working directory using the save icon.

Loading Data

Let’s get on with loading our data. Type

data = read.csv(file.choose())

into the script file and again hit CTRL + Enter whilst your cursor is on the same line to run the command, you can also highlight a block of code and using CTRL + Enter to run the whole thing.

You should see a file browser window; navigate to the CSV file you saved earlier containing the FHRS data. Note the syntax of this command, it creates a variable called data on the left hand side of the equals sign and assigns it to the file loaded in using the read.csv command. Once loaded, you should see the new variable, data, appear in the environment window on the right hand side. To view the data you can double click on the variable name in the environment window and it will appear as a new tab in the left hand window. Note the variables that this data contains. The object includes useful information such as the business name, rating value, last inspection date and address.

Summary statistics

Let’s do some basic analysis. To remove any records with missing values first run the complete.cases command:

data = data[complete.cases(data),]

here we pass our data variable into complete.cases which removes any incomplete cases and overwrites our original object.

To run some basic statistics we need to convert the RatingValue variable to an integer:

data$RatingValue = strtoi(data$RatingValue,base =0L)

Note how we use the $ to access the variables of our data object.

To see the minimum and maximum rating values of food outlets in London we can use the minimum and maximum functions:

min(data$RatingValue)
max(data$RatingValue)

These commands simply give us the minimum and maximum values without any additional information. To see the full records for these particular establishments we can take a subset of our data to only include those which have been awarded a zero star rating for example:

star0 = data[which(data$RatingValue==0), ]

Creating a graph

Lastly, let’s create a barchart to look at the distribution of star ratings for food outlets in London. We will use the ggplot library, to install and then load this library, call:

install.packages(‘ggplot2’)
library(ggplot2)

To create a simple barchart use the following code:

ggplot(data = data, aes(x = RatingValue)) + geom_bar(stat = "count")

Here you can see we have passed RatingValue as the X axis variable in the ‘aesthetics’ function and passed in ‘count’ as the statistic. The output of which should look something like this:

To add x and y labels and a title to your graph use the labs command at the end of the previous line of code:

ggplot(data = data, aes(x = RatingValue)) + geom_bar(stat = "count") + labs(x = "Rating Value", y = 'Number of Food Outlets', title = 'Food Outlet Rating Values in London')


rachel oldroyd

Rachel Oldroyd is one of our UK Data Service Data Impact Fellows. Rachel is a quantitative human geographer based at the Consumer Data Research Centre (CDRC) at the University of Leeds, researching how different types of data (including TripAdvisor reviews and social media) are used to detect illness caused by contaminated food or drink.

Mapping divorce and religion in the Czech Republic

Divorces, religion and education in the regions of the Czech Republic, 2011 (data about divorces from 2018)

Klara Valentova explores mapping data from her home country.

Note: In the following maps, darker colour and higher layer signify higher proportions of whichever variable is being portrayed.

Map 1 presents the divorce rate in different regions of the Czech Republic in 2018.

Divorces are most prevalent in the Central Bohemian Region, which surrounds the capital city Prague. Prague has a lower percentage of divorces, and one could argue that’s because young people move to Prague, where they find a partner, get married and start a family, move outside of Prague to the Central Bohemian Region, where they eventually get divorced. We can also see that there are quite high divorce rates in the North and South East of the Czech Republic.

Map 1: Divorce rates in the regions of the Czech Republic, 2018

 

Map 2 shows the proportion of religious population in regions of the Czech Republic in 2011.

The most religious regions are in Moravia, the East part of the country, while there is a little religious population in the North West. A surprising finding is, that some of these regions with very religious people have quite high rates of divorces as seen in Map 3, while in non-religious regions in the West Bohemia, divorce rates are relatively low. And so Czech Republic does not necessarily follow the believed phenomenon of religious people getting divorced less than non-religious people.

Map 2: Rates of religious population in the regions of the Czech Republic, 2011

 

Map 3 shows the distribution of people with a university degree across the regions in the Czech Republic in 2011.

In general, the North and the division between Bohemia and Moravia have the smallest number of people with degrees, while in the capital city, there is an enormous peak with nearly half of the population having a university degree. The South Moravian Region has the second highest proportion of people with university education, which can be explained by the second biggest city in Czechia, Brno, being situated there. However, there seems to be no correlation between education and religion or divorce rates in the Czechia.

Map 3: The distribution of people with a university degree in the Czech Republic, 2011

 

The data are available at: https://www.czso.cz/csu/czso/home, and the boundary data for Czech regions at: http://www.diva-gis.org/datadown. Both files were then uploaded to QGIS, joined, coloured by the proportions, and subsequently turned into 3D maps with higher areas corresponding to higher proportions to enhance the differences even more.

You can play with the 3D maps by following the links below. Please note that the maps can take some time to load.

Mapping the census – connections between language ability and health

Rabia Butt uses mapping to explore possible connections between health conditions and fluency in English.

From the UK census of 2011, I decided to compare people whose first language isn’t English, but they can speak very well, or they cannot speak at all. I was trying to discover how their proficiency in English would have influence on their general health.

I got my data from the UK Data Service Infuse website and compared the results at England wards level.

My first 3D map showed results of people health who have said that their health is good.

The results were what I had expected them to be: people who can speak English had claimed that their health is good, by a significant amount compared with people who cannot speak English.

Whereas, when I was comparing the result of people who have said their health is not good showed that people who can people speak English claimed that their health is not good more than of people who cannot speak English.

I had expected the results to be other way around, however there may be many other reasons or factors that had an influence on the results. The 3D maps with my data were able to show me which places had the highest peaks and where it was the lowest.

This 3D map is of people in England whose health isn’t good comparing with people who can speak English and cannot. The orange represents people whose health is not good, but they can speak English and the colour green is for people whose health isn’t good and they can’t speak English. The lighter the colour is the less people there whose health isn’t good. The darker the colour the more people there are with health isn’t good.

 

This 3D map is of people with good health and can speak English.

 

This map is of people whose health is good and cannot speak English.

You can play with the 3D map by following the link below. Please note that the map can take some time to load.

Mapping gender in Pakistan

Rabia Butt explores mapping data from her home country.

I created my first 3D maps from using the census data that interested me, so I chose to use the Pakistani census data of 2017 and 1998 to look at the difference in the gender population of Pakistan in 3D maps.

I got my data from the Pakistani census website. I created a map which showed the different gender population in Pakistan in 2017 which had male, female and transgender. The transgender population was extremely low and the there was a difference in the male and female population as well as male was higher than female.

Therefore, I decided to compare the 2017 result the previous census of Pakistan which had a 19 year gap since the latest census. The previous census did not include transgender people and there was still a gap between the male and female population as male population was still higher than female.

This map shows the population of Pakistan from the census of 1998. The blue represents male and pink female.

 

This map shows the population of Pakistan from the census of 2017. The blue represents male and pink female.

 

You can play with the 3D maps by following the links below. Please note that the maps can take some time to load.

Mapping annual net income in the UK

Annual net income in the UK in 2016 for Middle Super Output Areas (MSOA) – Before and after housing costs

Klara Valentova has been exploring mapping of data.

Note: In the following maps, darker colour and higher layer signify higher income for the specific area.

Map 1 shows the annual net income before housing costs in the UK in 2016. The highest income is distributed in the South East, notably around London with peaks in central London such as Westminster or Chelsea. However, income in nearly all areas in Wales is lower than in most areas in England.

Map 1: Annual Net Income Before Housing Costs in the UK for MSOA, 2016

 

Nonetheless, when looking at Map 2, displaying annual net income after housing costs, suddenly the huge differences between the areas have vanished.

The highest incomes are still distributed in the South East, but we can see that in big cities in the North of England, the incomes are almost as high as down south. The peaks in the London area persist but there are more of them now, and they are mostly around London rather than in the city centre as it used to be before accounting for housing costs. This can be explained by the incredibly expensive living costs inside London.

Map 2: Annual Net Income After Housing Costs in the UK for MSOA, 2016

 

The data for both of the maps are available at: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/smallareaincomeestimatesformiddlelayersuperoutputareasenglandandwales.

The files were uploaded to QGIS, together with boundary data for MSOA, available from: https://borders.ukdataservice.ac.uk/. These two layers were then joined, and the map coloured by the income level.

The map was subsequently turned into 3D with the height of the areas corresponding to the income level to enhance the differences even more.

You can play with the 3D maps by following the links below. Please note that the maps can take some time to load.

Meet our interns: Klara

Rabia Butt and Klara Valentova are our Q-Step interns from the University of Manchester. Q-Step is a £19.5 million programme designed to promote a step-change in quantitative social science training, funded by the Nuffield Foundation and the ESRC. We asked Rabia and Klara to tell us a bit about themselves and their journey to this internship.

Klara

I am one of a small cohort of students taking a degree pathway ‘Sociology and Quantitative Methods’ at the University of Manchester.

I have been very enthusiastic about data analysis since the year 2014 when I completed a study exchange programme in the USA. I attended a local high school in Georgia and took a module called AP Statistics. I really enjoyed it, and decided I would like to study statistics even further. This, together with my interest in sociology and especially social inequalities, have led me to study Sociology and Quantitative Methods at the University of Manchester.

In my second year of University, I chose a module about data modelling. I have learned how to use R and developed critical thinking and problem-solving skills, but wanted to enhance these to a professional level. I have been very interested in working with the Census data, and in learning more about deprivation while improving my data skills. Thus, I am very lucky to have been given the opportunity to work at the UK Data Service on calculating Carstairs Deprivation Scores for the UK.

Carstairs index is a summary measure of relative material deprivation that was developed in the 1980s. It comprises four indicators from the Census, which relate to material deprivation (overcrowding, male unemployment, low social class and lack of car ownership).

Some of these variables, however, are a bit outdated, and so for our project, we have decided to include other indicators, which we propose are more up to date.

For instance, we will include total unemployment (female and male combined) in our calculations as there are much more women in the labour force than there were nearly 40 years ago when Carstairs index was created. Also, we take into consideration that lack of car ownership does not automatically imply deprivation as in urban areas not having a car might just be more convenient while in rural areas it is a necessity rather than an indication of wealth.

Over the last two weeks I have learned a lot about the Census, Carstairs scores and deprivation in the UK. Nonetheless, the greatest lesson I have learned is the importance of long, proper research.

We were researching for about two days, and overall found that all the papers did their analyses in the same way. Satisfied with our findings, we moved onto getting the data we thought we needed for the project and started with the analysis.

New questions then arose and we had to do some more researching. Suddenly, we discovered many new research papers, some of which did their analysis differently than us. We began wondering whether our work so far was correct. I started checking all the data we downloaded, did more and more researching, and realised that with this additional research I could do a lot of things in a more efficient way. I therefore regretted not spending more time on the initial research as in the end it would have saved me a fair amount of time.

On the other hand, practice is the best way of learning, and so I learned a great lesson, which will be very valuable to me in the future.

I am very motivated and excited now to learn other new things during the next 8 weeks. I am particularly thrilled that I will have the opportunity to use 3D printing and VR to visualise the findings of the project. These new technologies have an incredible potential and are beginning to be widely used in many job sectors, and so learning how to use them, and how to use them for presenting statistics is an enticing prospect for me.

Read Klara’s previous blog for the University of Manchester.

Meet our interns: Rabia

Rabia Butt and Klara Valentova are our Q-Step interns from the University of Manchester. Q-Step is a £19.5 million programme designed to promote a step-change in quantitative social science training, funded by the Nuffield Foundation and the ESRC. We asked Rabia and Klara to tell us a bit about themselves and their journey to this internship.

Rabia

As a second-year undergraduate student, who is currently studying Sociology at the University of Manchester, research, data collection and statistics are a large part of my course resulting in a new-found appreciation for quantitative methods.

The survey method in social research module introduced me to social statistics and the basic statistical concepts required for working with numeric survey data. This module enabled me to participate in the amazing Q-Step programme, which provides placement opportunities for students from the University of Manchester.

I stumbled across the UK Data Service, when I was exploring internship options for the summer. Through this programme I was lucky enough to be selected for a summer internship at the UK Data Service.

I was very interested and curious about where the data originates from and how data it is produced for researches, which is why I choose to do my summer internship here. I am delighted to work for the UK Data Service as I will be learning many new skills, allowing me to gain invaluable experience of a working environment, as well as helping me determine what type of job I would like to go into after I graduates.

I and my fellow intern (Klara) are working together on a project that we will be creating ourselves and presenting it at the end of our internship. The project requires us to calculate the deprivation measures of England, Scotland, Wales and Northern Ireland using the data from the Census of 2011, 2001, 1991,1981 and 1971.

The methodology is Carstairs which is an index of deprivation used in spatial epidemiology to identify socio-economic measures. A definition of deprivation is the damaging lack of material benefits considered to be necessities in a society.

The Carstairs index is based on four Census variables:

  • low social class,
  • lack of car ownership,
  • overcrowding
  • male unemployment

The overall index reflects the material deprivation of an area.

We will be using the data to calculate the population-weighted mean percentages and standard deviations (SD) for each component variables. Also, to confirm that all components have an even impact on the final score, each variable will be standardised to have a population-weighted mean of zero and a variance of one. Standardising contains subtracting the population mean from each variable and dividing the result by the SD (z-score method).

The variables Carstairs uses are outdated, since it was originally developed in the 1980s.  After discovering an article that introduced some new variables into their research, for an example they suggested replacing male unemployment in the Carstairs score with overall unemployment. The reason is to consider the participation of female labour force. We decided to this to our research as well, so we will be using old methodology and including overall unemployment and qualification levels as well.

After we have collected all the data we need, the next step will be learning R language, as we will be using this for our analysis.

Finding and downloading data at first, I thought would be quite easy. However, this was not the case because each census was different from each other, especially the previous ones.

For example, the census for 1991,81 and 71 the social class variable was in 10% for all the countries apart from Northern Ireland. This would cause difficulties when comparing the census with different geographical areas. This was just one of the problems we encountered with the census.

I will be using R to calculate the mean, standard deviations and the z-score for each variable of each year and area. This is a task which I am quite excited about because learning a programming language, would appear quite skillful on my CV and I have never done this before.

Read Rabia’s previous blog for the University of Manchester.