We’ve asked our #DataImpactFellows to share their ideas about ‘change’.
David Kingman tells us about how his experiences of learning to programme in R have changed his research career.
Over the past two years, a very big change has occurred in my working life which seemed like a natural subject for me to write about, which has been learning to programme using R. Prior to this I had virtually no experience of programming, whereas now I use programming to achieve things practically every day when I’m at work. Apologies in advance if any of what I write below is really obvious!
What is R?
R is an open-source programming language which is mainly designed to be used for data analysis and data visualisation. Being open-source, R is free for anyone to use, because all of its features have been developed by an army of thousands of volunteer developers who have created “packages” (collections of functions which extend R’s basic set of capabilities) over the past couple of decades.
More background information about R can be obtained from the R Foundation Website, or from R’s wikipedia page. R is generally regarded as one of the “big two” open-source programming languages which are currently being widely used for data science, alongside Python, which is its main competitor.
Although R itself is a language, most R users interact with R using RStudio, which is an open-source integrated development environment (IDE) which enables you to write and execute code and store the datasets which you’re working on in one place. Their open-source nature means that getting hold of R is as simple as downloading R and Rstudio from their respective websites, and then learning how to use them.
My initial introduction to R was an example of necessity being the mother of invention. By about mid-2018 I had undertaken a number of successful research projects during my time at the Intergenerational Foundation (IF), but the fact that we were a small charity meant that we were limited in the tools we had at our disposal, and therefore the kinds of analysis that I could undertake.
Up to that point, my two main tools as a researcher had been Excel and the open-source GIS application QGIS, which I had used to visualise spatial data (for example, I used both to produce Generations Apart, my 2016 report which looked at the growth of age segregation in England and Wales over the previous 25 years).
I was keen to undertake more ambitious research projects, but (like many small charities) IF didn’t have the budget to be able to afford the licenses for expensive proprietary statistical software such as SPSS or Stata. However, at about this time I happened to have a conversation with a friend who is a web-developer who encouraged me to learn a programming language, and I also heard about R during a presentation at an academic conference that I was attending on behalf of IF, which convinced me to give learning R a try.
How did I learn R?
The short answer is that I learnt R by practising using it, as would be the case with learning any other new skill. Although I had taken several statistics modules at university, these were restricted to looking at data using SPSS, so analysing data using code was completely new to me.
Although there seems to be a general perception among R users that R has quite a steep initial learning curve, I think that once you’ve got the hang of a set of basic concepts it actually becomes much easier to analyse data using code than it is via GUI (graphical user interface) software.
Initially, I did a Lynda.com course on R programming, and since then I’ve read a number of books on R, of which I think the most useful was R for Data Science by Garrett Grolemund and Hadley Wickham. There are also a lot of very good online courses which teach you how to use R, particularly those provided by Datacamp.
However, I found that the way I learned fastest was just to try and do things with R, and to persevere with the obstacles I encountered until I got the desired result. There is a vibrant online community of R users who are an endless source of useful advice whenever you get stuck on something; a great place to find the answers to many R-related questions is Stackoverflow, an online forum where people post programming queries which the user community provides answers to.
What do I use R for?
As there are now over 10,000 packages which have been made for R, it’s possible to do almost anything that you want to do with data using R, from data wrangling to visualising data, to building models and and working with machine learning and AI.
Whenever I embark upon a research project, I now use R to manage my entire workflow from beginning to end (which now often includes writing-up and presenting findings from my analysis using R Markdown instead of Word). This has a number of advantages over how I was doing things previously, some of which I’ll outline briefly:
- Reproducibility – I can copy code which I’ve already written from one project to another; this means that once I’ve worked out how to do something once, I can do it again almost instantly, or with only minor alterations, which saves a lot of time and work. It’s a good idea to learn how to use a version control system such as Github with R so that you build up an online repository of code that you can reuse for future projects.
- Efficiency – I can now also work with much larger amounts of data, or with multiple data sets (for example, multiple years of data from a long-running social survey) more efficiently, because using loops or apply functions in R makes it easy to apply the same set of functions over multiple datasets rather than doing so one at a time.
- Error detection – Working with code makes it much easier to spot where I’ve made mistakes during my project workflow because I have a record of what I’ve done at each step the process I’ve followed, which means I can go back and change things without having to laboriously repeat all of the subsequent steps in the analysis.
- Encouraging experimentation – One of the biggest benefits of working with R is that I’ve found it encourages me to be much more experimental when it comes to interrogating my data for insights, because it’s so straightforward to go back a step in my analysis and re-write my code if I attempt something which doesn’t work or doesn’t produce the desired result. This is a huge improvement compared with having to laboriously repeat many steps in the way you would do with a GUI software programme, and greatly reduces the risk that new mistakes will creep into my work while I’m retracing my footsteps.
- Visualisation – R is especially renowned for its data visualisation tools (particularly the famous ggplot2 package), which are so powerful that a number of major organisations now produce all of their data visualisations using them, including the BBC and the New York Times.
On that last point, one of the first big things I did with R was to create a custom template using ggplot2 for making data visualisations which carry IF branding (such as this one from our recent research into how what young adults spend their money on has shifted over time), which has enhanced our brand-awareness and made it easier for us to publicise our research on social media.
Although I still have much more to learn about R (and I’m also now learning SQL, another programming language, as well), I’ve found the experience of learning how to programme absolutely fascinating, and I’ve already felt the benefits from it within my research career, as I’ve recently started working part-time as a Senior Research and Statistical Analyst in the Demography Team at the Greater London Authority (alongside my role at IF), where the entire team uses R for the vast majority of its data analysis and visualisation projects.
Overally, I would thoroughly recommend learning R to anybody who works with data, as its now widely used within academia, the public sector and the commercial world, and because it is both freely accessible and tremendously powerful once you’ve overcome some of the initial learning-curve.
David Kingman is one of the UK Data Service Data Impact Fellows 2019. He is the Senior Researcher at the Intergenerational Foundation and a Senior Research and Statistical Analyst in the Demography Team within the Greater London Authority Intelligence Unit.
David is a quantitative senior researcher and data analyst with a wide range of interests including population demography, economics, inequality, housing, pensions, higher education, political representation and wellbeing in his current role as Senior Researcher at the Intergenerational Foundation (IF). The IF is a non-party-political think tank which researches intergenerational fairness.
David is frequently invited to present IF’s work at conferences, seminars and roundtable discussions, and has appeared as an expert witness before select committees in both Houses of Parliament, at All-Party Parliamentary Groups and before the Low Pay Commission on two occasions. In June 2018, he addressed a large audience at the European Parliament as part of the 2018 European Youth Event.