Previous MSc Theses
"Extracting Data from Insurance Documents with Natural Language Processing and Machine Learning." J. MacKenzie. W. Wallace. Department of Computer and Information Sciences, University of Strathclyde. 2019. Download PDF (BibTeX) SD
Abstract:
The purpose of this study is to develop a piece of software that will automatically extract information from insurance documents to save time and reduce errors. The insurance industry is ripe for disruption due to legacy systems and heterogeneous data sources and is still operating in a similar manner to almost 100 years ago. There is a need for major innovation to protect companies’ market position from outside forces, and this can be achieved by embracing modern technology.
In the recent past the issue of extracting certain information from PDFs and other semi or unstructured documents has been dealt with by humans in most cases, due to the complexity of the task and the need for an understanding of the content. There have been attempts in the past to automate this process, which involved developing rule-based systems to extract the necessary information. This way of working does do the job, to an extent, but what about when the document is updated and something new is added, or a new structure is adopted? Well this is where machine learning comes in. Machine learning gives us the ability to feed a computer annotated examples of the types of data we are trying to extract, and the computer will then try to make predictions when shown data which is not annotated. During the course of this project I have explored multiple avenues while trying to solve the problem and, in the end, I came back to the idea of training a custom model for this specific task. Due to the extraordinary amount of training data that would need to be created and annotated, and the limited timescale, I decided to solve the problem on a smaller scale by focusing on one type of document from a specific insurer.
During this dissertation I explored the current market for extracting information from PDF documents and came to the conclusion that the best suited method for the purpose of this project was through the use of machine learning and natural language processing. Through working closely with the insurance company, I was able to identify the type of information that was required to be extracted and build a piece of software around this. The software was built using the Python programming language and the natural language processing and machine learning was handled by the spaCy NLP library. The first iteration of the software led to 82% accuracy when extracting information from a test document. This first build of the data extraction software has served as a successful proof of concept for a larger piece of work. I will now be working with the insurance company for the foreseeable future to further develop the data extraction software and build a web application component to integrate with their online broker system.