Analysing hate crime in Lancashire – Application of natural language processing for identification of online hate on Twitter

Understanding hate crime is a priority for police forces across England and Wales. Since the EU referendum in June 2016, there has been a renewed emphasis on the importance of preventing hate crime and providing support for victims.

The Home Office reported an increase of 29% in the number of hate crimes recorded between 2015-16 and 2016-17 which represents the largest increase since the Home Office began recording figures in 2011-12. In addition, the Crime Survey for England and Wales (CSEW) has indicated that up to six times as many hate crime incidents occur every year than are reported in police figures.

Our researchers collaborated with Lancashire Constabulary on two projects to improve understanding of hate crime in the local area:

Analysis of police-recorded hate crime
Application of natural language processing for identification of online hate on Twitter

Data and methods

The research was based on the spatial analysis of tweets sent by people in Lancashire during the study period (December 2015 – February 2017). These tweets were identified based on both the hometown displayed on the profile of Twitter users and the precise geotags of where the tweets were sent from. In total, 1,246,918 tweets with a hometown location and 389,410 tweets with a geotag were collected within the boundaries of Lancashire.

A classifier was built using Natural Language Processing (NLP) techniques to determine whether tweets contained hateful speech. This first involved training a machine learning algorithm with tweets that had been manually classified. The accuracy of the trained algorithm was then tested on new tweets for which the classification outcome was already known.

Once the classifier was confirmed to have a reliable level of accuracy, it was applied to the Lancashire tweets. This identified hateful tweets which were displayed on density maps at county (see Figure 1) and town level.

*Figure 1: Density map of geotagged hateful tweets in Lancashire for the study period*

Key findings

This analysis revealed that it is indeed possible to create an English language classifier which accurately identifies online hate speech on Twitter.

However, the limited number of tweets with geotags identified as hateful by the classifier did not allow for as meaningful a geographical interpretation when considered at the individual town level.

Impact

The learning from this project has emphasised the valuable information provided by social media data when dealing with under-reported crimes. Twitter produces real-time data which is helpful in generating a spatial and temporal ‘temperature check’ of different localities.

The algorithm developed in this project offers the potential to be used by Lancashire Constabulary to monitor levels of hate, thus ensuring resources can be allocated effectively to respond to emerging community tensions.

Researchers

Natacha Chevenoy, Leeds Institute of Data Analytics and School of Law, University of Leeds
Carly Lightowlers, University of Liverpool
Nick Malleson, Consumer Data Research Centre and School of Geography, University of Leeds

Partners

Lancashire Constabulary

This project was completed as part of the Leeds Institute for Data Analytics internship scheme.