@MASTERSTHESIS{pgi2018001,
	author = "A. Oberacker",
	supervisor = "G. Weir",
	title = "Textual Analysis for Document Forensics",
	school = "Department of Computer and Information Sciences, University of Strathclyde",
	year = "2017",
	abstract = "In this dissertation text categorisation is applied to three datasets originating from the surface and dark web, two of which deal with extremism and drug related texts, whereas the third contains texts with four distinct contexts. Machine learning algorithms are applied to a numerical representation of the texts generated by a tool called Posit. This tool develops statistical data, such as average sentence length, and number of occurrences of parts-of-speech types and tokens. In addition to those values, bi-gram ratios are calculated from their frequency in the texts compared to their prevalence in the natural language.

The goal of this dissertation is to assess how effective the limited number of 30 features is for multiple classification algorithms. After conducting several experiments covering a wide range of settings, the research showed that on two of the three corpora the classifications based on the Posit and bi-gram features were evaluated to be very accurate. The third dataset demonstrated that classifying a text corpus with multiple contexts is diff- cult with the feature sets given.

This research showed that reducing a text corpus to its numerical information given by Posit is a powerful way of efficiently classifying large datasets.",
}