Experimenting with AI: My Experience Creating a Closed Domain QA Chatbot

James Brill, graduate developer and Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service introduce us to the world of developing an innovative chatbot for answering research data management queries.

A chatbot is a computer program which responds to user input either by textual data (typing) or audio data (speaking). The UK Data Service wanted to utilise this emerging technology to benefit its users by reducing the response time for its research data management (RDM) queries. Louise  explains: “The idea was inspired by a presentation on ‘Cognitive systems redefining the library user interface’ presented by Kshama Parikh at the International Conference on Changing Landscapes of Science and Technology Libraries in Gandhinagar in the province of Gujarat, North West India  late last year. I envisaged a new online service: Meet Redama, our new chatbot for research data management. You can ask Redama anything about research data management, and she will be available 24/7 to provide an instant response as best she can. And, most conveniently, the night before your research application is due in to the ESRC! Go on, ask her about consent forms, data formats, and encrypting data – her breadth of expertise is excellent. However, with only 3 months to try it out, that was always a big dream!”

We hired James Brill, a recent graduate from the University of Essex for a summer project to develop a chatbot to try and solve a closed domain question answering (QA) problem, using the domain of ‘research data management’. He worked closely with his supervisor, Dr Spyros Samothrakis, Research Fellow in the School of Computer Science and Electronic Engineering. James highlights the main steps and some of the challenges that arose.

James explains: “The basic premise of a QA chatbot is to map a question to a relevant answer. This can be done in different ways from machine learning to more statistical approaches such as cosine similarity (which still have an element of machine learning to learn the initial vectors). In other words, one can draw similarities between question and answering with information retrieval. What follows is my experience of chatbot development.”

Stage 1: Sourcing and cleaning the data

As with many AI problems one first needs to have valid “real life” data which can form a useful baseline.  Here at the UK Data Service we used two sources:

  • queries related to research data management, submitted from users to the Service via our web-based Helpdesk system
  • text from our fairly substantive RDM help pages

Once the first set of data was assembled, the next step was to clean and pre-process it so that it could be fed into a mapping function which tries to predict an answer for a given question. We did this by taking existing emails and putting each individual one in its own text file. This resulted in a collection of 274 plain text files.

The same was done with all the data management web pages – each page is located in its own individual text file. This proved to be a time consuming task. Due to the initial file format of the data, both knowledge bases had to be generated by hand. Textual data often has special escape characters such as newline “\n”, or tab “\t” which need to be stripped from the text, and this was performed when separating the emails and web pages into separate files.

Once we had set up two simple knowledge bases, we then created a data management object.  This object loads all the necessary scripts and acts as a simple interface between a chatbot and the data itself. The simple script I wrote is located in Datamanager.py. This way, one can change the existing knowledge base structure underneath (to a database instead of a collection of files for example) and because the Datamanager object acts as an application programming interface (API), the chatbot itself will not require any code changes.

Step 2: Calculating a Baseline

A baseline performance has to be established to see what can be achieved using someone else’s framework, or by adopting a well-used approach. For this project, we decided to use chatterbot which we adapted to feed in questions and answers from our dataset using the DataManager object. To see the exact script refer to https://github.com/jamesb1082/ukda_bot/tree/master/bot. However, this particular chatbot did not work well with our dataset due to its relatively small size.  The examples given in the chatterbot project use rather large conversational corpora such as the Ubuntu Dialogue corpus. Having been trained on our corpus, it achieved a precision of 0 almost every time and so any model going forward would be just as good as this particular framework for our dataset. This chatbot was a good learning experience to start off the project, as it highlighted some key issues around building a chatbot, such as storing and inputting data.

Step 3: Using Machine Learning Techniques/Statistical Techniques

Having now established our baseline, we then approached the problem of designing a mapping function from a different angle: using machine learning and information retrieval techniques to generate relevant answers.  First of all, it is important to establish how similarity and relationships between words can be modelled in a computer program. The modern approach is to use Vector Space models which map each individual word to a unique point in vector space, in other words, each word is represented as a series of 100 numbers. The position of each word in vector space is relative to all other words in vector space:  words that have similar meanings are close to each other, and the resulting vector produced by the subtraction of one word’s vector to another defines the relationship between two words. A common example is King-Queen: Man – Woman where each word is actually the vector of that word. The detail of how these vectors are generated is beyond the scope of this blog, just to note that they are critically important.  Enabling words to be treated as numbers means that mathematical calculations can be performed on lexical items.

Step 4: Siamese Neural Network

Since the previous two methods performed unsatisfactorily, we adopted a different approach which centres on using “neural networks” to learn and generate a mapping function instead. As the dataset we are working with is rather small (only 171 correct QA pairs). We opted to use a Siamese Neural network (SNN) which is a special type of neural network consisting of two identical neural networks which share a set of weights. The question vector is fed into one neural network and the answer is inputted into the other network (see diagram below).

The distance between the output of the two neural networks is calculated with the idea being that the difference is 0 when the answer is correct and 1 if it is not. The weights are updated to adjust the network depending on whether the answer was right or wrong and by how much. Essentially, by training the network in this manner, we can calculate the distance between a question and an answer, which in turn acts as a distance function. This stage of the project was the hardest theoretical part of the project.  However, the actual coding was relatively straightforward, due to the very simple, modular API provided by Keras.

Step 5: Results

After implementing and training the model on our dataset, we performed some testing on it, to see how well it actually performed in different scenarios. The first test used the complete training set, to see how well it “remembered” questions, with our dataset correctly identifying 79% of questions. It is important to note that one does not want 100% at this stage, as it is a common sign that the model will have likely just memorised the initial dataset, and has not generalised the relationships between questions and answers. When we tested it on unseen questions, our model did not perform particularly well, however, we suspect that this is due to some answers only having one relevant question, meaning that it cannot generalise well.

Step 6: Further Improvements

Neural networks learn through being shown examples, and as a result, the performance of a neural network is reliant upon the quality of the dataset it is trained upon. Although deploying a very small dataset and we did upscale it to contain correct and incorrect QA pairs, it often featured only one or two correct QA pairs for certain topics. To combat this issue, one could improve the dataset by not only asking more questions but seeking a more uniform distribution of questions. For example, our distribution (see below) is not even, with some very dominant peaks and with a lot of answers which have very few answers pointing at them. We also used Stanford’s SQuAD to directly compare the model provided in this project against other chatbot model. To do this, we modified our data loading functions to read in the training data, and modified our script to output a prediction file.

James told us: “Overall, I think the project went well, with a basic interface chatbot having been created. With the possible expansion of using other datasets and some tuning of parameters, I believe the main method implemented in this project will be reasonably successful.”

James was embedded​ into the Big Data Team at the UK Data Service for the duration of the Chat Bot work. Nathan Cunningham, Service Director for Big Data, said: “It’s really important that the we start to investigate new methodologies for understanding and structuring our data challenges. Over the next couple of years we aim to try and organise our collection and resources with enough context and information to make sense to anyone reading, querying or using the Service. The Chatbot work was an initial pilot study in using Machine and Deep Learning algorithms to construct a knowledge base and answer questions. We now have better understanding of what we need to do to build a UK Data Service knowledge base and how this will improve our data services.”

James’ supervisor, Spyros says “I was delighted to work with UK Data Service in bringing this exciting project to life – a good question/answering bot can greatly decrease first response times for users queries. A number of exciting developments in Neural Networks were incorporated into the bot – the bot is not just trying to assess the existence or not of certain words, but aims at inferring the exact semantic content of the questions. Louise says: “James has moved on to the Advanced Computing – Introduction to Machine Learning, Data Mining and High Performance Computing course at the University of Bristol and we wish him well in his career. We are hopeful that we can attract another excellent Master’s student to work on implementing a chatbot for us.”


Papers that introduce datasets:


Keras Tutorials:


What’s in there? From encoding meaning in legacy data access platforms, to frictionless data

The UK Data Service‘s Victoria Moody and Rob Dymond-Green on freeing data from legacy data access platforms.

Legacy data access platforms are beginning to offer us a window into a recent past which provides a sense of the processing and memory size limitations developers had to work with; now difficult to imagine.

These platforms offer a sense of how data access models were defined by technology that at the time looked like a limitless opportunity to construct new architectures of data access – designed to be ruthless in their pursuit of a universal standard of full query return – that is, identifying all possible query lines, and being engineered to fetch them. An unsustainable technical debt which precludes discovery and analysis by newer workflows.

Although designed for form to follow function, it is elements of the human scale which surface and make them endearing as well as brilliant in the way they approached problems they solved. They proliferated, but how many endure, especially from the late 1990s to early 2000s? The team here is interested in first generation data access platforms still in use, we’d love you to tell us about any you still use.

In the late 1990s the team making UK census data available to UK academics and researchers designed CASWEB which gives access to census aggregate data from 1971 to 2001. Casweb is nearly 20 years old and survives – it is still used by many thousands doing research using census aggregate data, returning tables which users can select by searching their preferred geography then topic.

Image: The Casweb data engine returning data in ‘steampunk’ style

InFuse – the next stage innovation to Casweb gives access to 2001 and 2011 census aggregate statistics also used by thousands each year and offers the user the chance to select  their data through the route of geography or topic.

Image: InFuse

Casweb is now a platform at risk, with the potential that it will eventually not be compatible with browsers in use; its programming language predates modern standards. Now that people can fit a ’90s supercomputer on their mobile ‘phone, the capacity for data access offers a different, more prosaic but more communal route.

Image: Legacy Casweb code – “Please install Netscape 3.0”

The data are now all open but that doesn’t mean they are discoverable or straightforward to get. We’re taking the opportunity to retrieve the data built into the table-based web pages of Casweb and freeing the data  from the prescribed routes into the data that both Casweb and Infuse offer, turning instead to routes to improve discoverability and usability – cost effectively, and aligned with the latest standards for frictionless data.

Our plan is to go back to basics, to deliver the minimum viable product, as discussed in a post by John Matthews How we learned to stop fearing technical debt and love frameworks using  DKAN.

But we need to do some work.  Jamey Hart’s post,  Making Metadata for the 1971-2001 Censuses Available,  describes a project he is working on where we are liberating metadata from Casweb and making it more usable for our users. We are going to go a step further and export the data which go with this new metadata, to create datasets to load into our shiny new dkan instance.

Luckily, the metadata in InFuse is stored within the application in an easier to digest format than Casweb, so we are also in the process of exporting this together with the data into a set of lightweight data packages. Our aim will be to format the  data and metadata from Casweb and InFuse to use the same structures, which means that our users won’t have to grapple with learning how data is structured when moving between censuses.

We’ll start by exporting data from InFuse 2011 using information from our download logs to determine which are our most popular datasets. For our initial data release we’ll be delivering 50 datasets through dkan, and will continue to release data in tranches of 50 until we’ve finished making all our data available as csv.

But we’re interested in hearing how you’d like us to approach these dataset releases. An option could be for us to release 50 datasets from 2001 which cover the same type of topics as the ones in 2011, then equivalent datasets from 1991 and back to 1971, thus building a time series as it were of data. It’s just one idea, you may of course have other ideas as to the priorities we should use when releasing this data, please get in touch and help shape how this process goes.

We’ll take care of and maintain Casweb and Infuse, but won’t add any data or develop them any further. We’ll also move away from impenetrable acronyms to name our data access platform (although ‘Cenzilla’ was mooted at one point…)

Ultimately, we aim to free the data for better discovery and, in the spirit of frictionless data, make it easy to use the data for sharing and utilisation whether that’s in CSV, Excel, R, or Hadoop.

Please let Rob know what data combinations you’re using from censuses from 1971 up to help us structure our retrieval schedule.

Engaging more people with more data: Developing the data quiz app

Ralph Cochrane and the team at AppChallenge have been working with the UK Data Service to support new ways of engaging more people with data, mainly through mobile app development, hackathons and coding competitions. Ralph  introduces the UK Data Service’s new quiz app:

Our latest initiative is the UK Data Service Quiz App (available for free on both Android and Apple devices) which builds on ideas that we crowd-sourced from developers around the world and discussions with the team at the Service. We wanted to find a way to bring some of the open data sets to life and appeal to a wider audience, not just within the data science community.

Quiz Master – the content management system

The app is driven by a content management system (built using PHP and mySQL) which is hosted on the UK Data Service infrastructure. Similar to WordPress, it’s a web based system that allows members of staff to add a new quiz, create questions and even run their own internal games to test how well the questions will be received.

Mobile App development

Developing the apps took a little more time. We focused first on the iOS app for Apple devices, which has been developed in a language called Swift provided by Apple. We’ve been impressed with the rigour with which Apple tests apps before they are published on the app store, even if it’s been painful for us at times when they’ve found little things that they consider to be a problem. If you have an iPhone why not try the UK Data Service Quiz App and tell us what you think on Twitter @AppChallenge?

Moving on to Android, we have just issued an update to the app which brings it closer to the iOS app, in particular fixing an issue with offline use when there is little or no mobile reception. There are many more Android phones out there, but that is also a double edged sword. Designing an app that will work well on a wide range of devices with different capabilities is not easy. There’s also the fact that many Android users don’t update their ‘phone regularly, so even the underlying operating system can have many variants.

Lessons learned

Creating something intuitive and fun in an app that visualises data in an easy-to-engage-with way is not easy. From our first “finished” version of the iOS app to the version in the appstore to now we’ve issued eleven minor updates. Most fixes are to the user interface or to deal with highly unlikely scenarios e.g. you start the install then lose Internet access, but we’ve learnt a lot along the way.

Engagement is harder than it sounds too. We’ve added functionality, such as being able to send all iOS users notifications e.g. “It’s the FA Cup Final today. Have you tried our football quiz?” because we noticed that new users would install the quiz app and then only use it once. There is real skill to both creating questions that are interesting and promoting the app online to generate more downloads. Perhaps the biggest lesson is that the app is merely one tool in the toolbox to reach new potential users of UK Data Service data. It works best when it is integrated with topical events and other outreach programmes e.g. via Twitter, Facebook and the blog.

Football data quiz

We’re continuing to promote the app online directly within app stores in the UK and to date we’ve had about 1000 downloads. If you know of anyone who likes general knowledge quizzes, why not ask them to have a go and give us some feedback? After all, it is free.

Mining data from Impact Case Studies in HEFCE’s REF2014 database

We wanted to mine some data from HEFCE’s API of Impact Case Studies submitted to the REF2014 to find out how many UK Data Service data collections were used in Impact Case Studies in REF2014 – and the case studies they were used in. Basic analysis of the database showed high usage of data – but very low citation using persistent identifiers (data DOIs).

John Matthews, software engineer in the census support team, talks us through the process Continue reading →