In my second year of University, I was tasked with gathering and cleaning government data with the goal of using this data to calculate the best towns in the West Midlands and Warwickshire area to relocate to. The data included average broadband speed, average house prices, and crime rates.
All the data was processed and visualised using the R language to create a clean database and several graphs. There were many steps involved in acquiring, cleaning and presenting the data, but to reduce the length of this page, steps of lesser significance have been omitted.
Cleaning the Datasets
The data was collected from different government entities, such as Ofcom for broadband speed, and the Office for National Statistics for house prices. The format for each dataset varied, with population data (for calculating crime rate) needing manual cleaning in Excel due to having multiple sheets with buttons in each, which otherwise would make processing in R more difficult.
For crime, a year’s worth of crime data was downloaded as opposed to a single month to ensure that there were no biases from varying crime rates throughout the year (e.g. from temperature, weather, length of day, etc).
The group_by and filter functions focus on selecting only towns in the two previously mentioned areas, as the crime dataset contained towns from all across the UK. The following lines join the population dataset to the crime dataset, allowing for crime rate to be calculated.
Broadband speed was very simple to clean, and mean broadband speed was chosen over other types of average.
Likewise, house price was similar
To calculate the score of each town, a normalising equation was used to take each data point and give it a score between 0 and 10 depending on where that data point falls in the overall range. For example, consider the following average download speeds.
The maximum download speed of 77.6 would be 10 and the minimum speed of 46.1 would be 0. All other download speeds are given a score of 0-10 based on where they fall between the minimum and maximum.
For some people, one category may be of more importance than others. For example, someone who is more concerned about the crime rate can apply a modifier to the rating for crime rate in the results, which can cause results to shift.
As you can see, prioritising towns with lower crimes changes the order slightly.
When applying no modifiers, the results suggest that Dudley is the best town to consider moving to based on the measured factors.
When comparing with population, there appears to be a positive correlation, especially when excluding Birmingham as an obvious outlier due to being the second most populated city in the entire UK, whereas all the other towns are far more comparable.
This trend makes sense, as there is also a positive correlations between download speed and population, but no correlation between population and crimerate, and a very weak correlation between population and house price (of which the score is calculated using normalize_reverse, rendering the score higher as population increases).