We’re always keen to find alternative ways for our users to access and use our data.
In evaluating DKAN as our new data delivery service, we discovered that it has a API, which allows other systems to find out about the datasets and resources DKAN holds.
We came across a package that is being maintained by Tony Fujs, Karthik Ramanathan and Meera Seladore. The project will ultimately move across to the World Bank github repository as its part of the work.
The package is called dkanr which is described as a “General purpose R client to the DKAN Open Data platform”.
I’m a novice user of R, so my R script may not be as polished as a more experienced R user, but I found the package easy to use and was able to use the package to find a dataset of interest and then download some data into a data frame ready for me to manipulate.
The Github page includes a readme, which explains how to setup the package and its basic use.
Trying out the package
I decided I wanted to find the dataset for sex from the 2011 Census.
I’m going to walk you through the process I went through. You can download the code from our Github.
To tell the package to use our version of DKAN I needed to enter the web address which is https://www.statistics.digitalresources.jisc.ac.uk/:
Now I loaded the settings:
This is what the output looked like:
I then got a list of nodes that are of type dataset:
nodes <- list_nodes_all(filters = c(type="dataset"))
I looked at the data frame returned:
This is the output:
The next job was to find datasets that have sex 2011 in the title:
dfFilter<-nodes %>% select(nid,title) %>% filter(str_detect(nodes$title,fixed("sex 2011",ignore_case=TRUE)))
I looked at the results:
I was only interested in the dataset for sex 2011, so I asked for the metadata for node 195:
metadata <-retrieve_node(nid =195, as ="list")
I then want to know what resources node IDs dataset 195 has:
For simplicity, I looked at the first resource:
resource_metadata <-retrieve_node("196", as ="list")
As I wanted the csv file, I asked for the url to that file:
Unfortunately, for the version of the dkanr package I was using this command failed, but looking through the source code for the package I worked out how to request the url, the developers have now fixed this issue, so hopefully you won’t run into, but I’ll leave the code here, just in case:
Now I could read the data into R and view it:
I was only interested in seeing the name of the area, the 2 data columns and rows 6:16, so I sub-setted the data:
I googled and used a package to reshape the data, called reshape2, as when I initially tried to plot the data it wasn’t working how I wanted it to look:
xymelt <- melt(xy, id.vars = "GEO_LABEL")
Then I graphed the results:
ggplot(xymelt, aes(x = GEO_LABEL, y = value, group =1, color=variable)) +