The Curse of Dimensionality
Visualisation is a primary way that we can gain insight from data. One of the challenges that visualisation presents is that it tends to work best when data has just one or two dimensions. On occasion however, the datasets we are interested in often have many more dimensions than this.
One way to make progress is to use analytic techniques collectively known as ‘dimensionality reduction’ to transform our data into a lower dimension so that we can visualise it.
Case Study
In the following example, we’ll be applying ‘dimensionality reduction’ to the results of the 2010 and 2017 UK General Election.
In our example, we would like to understand the voting characteristics of all of the constituencies in each of these elections. Typical visualisations for this type of data tend to present a map where each constituency is represented by the colour of the winning party. While this is an effective method of understanding voting patterns in instances where one party is dominant, it can give a misleading view of voting behaviour in constituencies where the vote is more evenly divided.
We can gain further insight into the voting patterns by using an innovative new algorithm called UMAP.
UMAP
A new dimensionality reduction algorithm called UMAP has been developed by a mathematician called Leland McInnes. Using UMAP, we can take lots of high dimensional data points and plot them in a two-dimensional graph. UMAP tries to position each data point close to the other data points that it has most in common with while keeping the data point away from the data points that it has least in common with.
The UMAP process is a bit like trying to sort a crowd of people by their interests. You ask each person to find out the interests of the other people in the crowd. Each person then tries to position themselves closest to the people they have most in common with, while distancing themselves from the people they have least in common with. In the end, you may find that clusters begin to emerge. For example, individuals interested in football group together while distancing themselves from a group of outdoors enthusiasts.
Crucially, the important information is the distance between the people. The absolute position in the space and the axes in the chart – do not matter.
Please note, both these graphs are interactive.
Reading the chart
Each bubble on the chart represents a constituency. The distribution of each bubble is set by UMAP according to the percentage of the vote that each party received in that election, in that constituency. The colour of each bubble is set by the party that won in that constituency. The size of each bubble is set by the majority of the winning party.
Data Stories
Each of these graphs presents us with a number of interesting insights. Most notably:
1. In both the 2010 and the 2017 elections, we see that there are in fact three separate election races taking place – one in England and Wales, one in Scotland and one in Northern Ireland.
2. In England and Wales, between 2010 and 2017, we see a movement away from a three-way race between Labour, the Conservatives and the Liberal Democrats, towards a more distinct, two-way race between Labour and the Conservatives. The change in shape between 2010 and 2017 highlights greater polarisation between constituencies.
3. Between 2010 and 2017, we see voting behaviour in Scotland shift significantly, with Labour losing several seats to the SNP. Scotland as a whole becomes much more politically diverse, with significant growth being made by both the Conservatives and the Liberal Democrats.
Conclusion
Using data to enable meaningful action, first requires that it can be understood.
At Analytics Engines, we deliver data analytics solutions that reduce complexity, provide insight and enable action.
UMAP is one of the many tools that we use to provide organisations with the insight needed to support their operations.
To find out more about how our solutions can support your organisation, speak with us using the form below.