//
you're reading...

Blog Posts

Grouping Prime Ministers by Wikipedia links

The chart displays each Prime Minister, plotted against the reduced dimensions which display the most variance in their attributes.

Grouping politicians can be a tricky problem for any kind of numerical analysis. Assessing their impact, background and personality lends itself to more naturally to qualitative over quantitative domains. Yet numerical analysis is needed to answer important questions such as how popular a leader is to the ability of a party to win an popular vote. Effectively grouping politicians can be a good first step, and present a useful component for an ensemble learning approach.

This enabled me to interpret each Prime Ministers as a coordinate in multidimensional space, and then group them based upon their distance from one another.

Here each of the Wikipedia pages for every Prime Minister since Robert Walpole have been scraped, summing the number of times they have links to other Wikipedia pages. This gave a data set with 13,248 topics counts for each of the 54 Prime Ministers considered, which was reduced this using Principle Component Analysis and grouped using K-Means Clustering. This enabled me to interpret each Prime Ministers as a coordinate in multidimensional space, grouping them based upon their distance from one another.

Using Wikipedia links as attributes has a two key benefits; one is consistency. As each link needs to point at the same page identification of shared attributes is easy. Secondly using links as attributes means that a costly search for key words using text analysis is avoided.

Wikipedia links are undoubtedly an imperfect measure of a Prime Ministers attributes. For instance most key topics are only linked to once in a topic. Consider Napoleon’s Wikipedia page, which has links in eight of the Prime Ministers Wikipedia pages: Lord John Russell, The Earl of Aberdeen, Henry Addington, Spencer Perceval, The Earl Grey and Stanley Baldwin all have one, William Pitt the Younger has two and The Duke of Wellington has four. Henry Addington was Prime Minister during the Napoleonic wars dominating his Premiership. Yet Stanley Baldwin, born almost half a century after Napoleons death, gets the same weighting as Henry Addington. This is down to a simple quote “Can’t we turn Hitler East? Napoleon broke himself against the Russians. Hitler might do the same”, where the term Napoleon is linked. The measure is imperfect, but the results are nonetheless effective by most measures.

Henry Addington was Prime Minister during the Napoleonic wars dominating his Premiership. Yet Stanley Baldwin, born almost half a century after Napoleons death, gets the same weighting as Henry Addington.

The clusters appeared to capture time period and party very effectively. Time period makes sense, as mentioned in the previous example, it would not be expected that Napoleon is mentioned for politicians of the 20th century. Differentiation by party is aided by the number of Wikipedia links to associated topics. For instance, a member of the Conservative party is likely to have a link not only to the Conservative party, but also to other Wikipedia pages associated with the party such as the Conservatives Women’s Association, LGBT Conservatives and Chairman of the Conservative party.

What is most interesting are the Prime Ministers who have been assigned a single cluster, notably Margaret Thatcher, Winston Churchill, Henry Asquith and the Duke of Wellington. Three had either fought or led in major global conflicts such as the Napoleonic wars or the two World Wars. Thatcher, Asquith and Churchill each served in number 10 for a total of eight years.

Some of the smaller groups make very clear sense, such as the grouping of Modern Labour (Cluster 10) and Modern Conservative leaders (Cluster 2). Going further back in time, the clusters become larger and start to fail to distinguish between party lines. This may be due to shorter Wikipedia articles and sparser content. There are also a few abnormalities, for instance Benjamin Disraeli has been grouped with politicians 50 years ahead of his time. As I didn’t assign starting points for the centre of each cluster there is some randomisation in the results, however when reproduced there clusters remain broadly similar.

The results aren’t perfect, but have shown themselves to be effective at finding groups without supervision. There is no reason why this process could not be used in a broader, ensemble learning approach to answer more involved questions in political science.

How it was done

The Wikipedia pages where scraped using python’s beautiful soup, converting the data into a csv which I then passed to R to conduct the PCA and K-Means analysis. I chose the number of clusters by using the knee point for the within cluster sum of squares, before converting it to a JSON file format to create the HighCharts graph above. All of the code is available on my GitHub page.

Discussion

No comments yet.

Post a Comment