Principal Component analysis is a statistical technique used to reduce the number of independent variables in a dataset, helping identify patterns in a data set. The “principal components” are a linear combination of the original variables sorted by the variance, where the first principal component has the largest variance. It has been coined one of the most useful methods to have come out of linear algebra. If you are interested in a statistical overview of this method, you can read more here.
Below is a in interactive visualisation which has been built using a 112 dimension data set, containing 40 points. This high dimensional data contains information on the outline of a hand. As mentioned previously, PCA is often used to reduce the number of independent variables, which can allow for visualisation on a 2D screen. Using the hand data set, PCA was performed with a python library, Sci-Kit-Learn, to reduce the dimensionality and allow for analysis. As the first principle component (PC) contains the information with the most variance, this were plotted against the second PC, which contains second highest variance. Following the PCA, 40 dimensions were generated, with the 40th containing the least variance of them all. You can try plotting different principle components against each other below.
It is also interesting to see which points have similar values and can be naturally grouped together. This can be done by analysing the graph by eye, or using statistical clustering methods. In this case, K-means clustering was used to identify points, or hands, are similar to each other. K-means clustering partitions n observations into k clusters, aiming to minimise the variance between each cluster. The mathematical techniques this employees can be read in more detail here.
Use the below visualisation to gain a better understanding of PCA. The graph shows the first two principle components plotted against one another. Using the input on the left, you can change which components are displayed to visualise the variance between different components. Once you have chosen which components you would like to visualise, by clicking the plotted points, the hand on the right will update with the corresponding hand. In order to facilitate the analysis of the points, and to identify clusters, you can use K-Means clustering by selecting a k value (number of clusters) and clicking “add clusters”. Please note, clusters of size 2 and below will not be visualised, so try to generate the clusters again if there are missing clusters in the plot. By hovering over a hand in the grid at the bottom of the visualisation, a point in the graph will be highlighted which corresponds to that hand. If you find text hightlighted in red at any part of the webpage, by hovering over, the graph will be updated and display some related information.
PC PC Generate clusters with k =
In order to understand better the PCA it will be useful to play with the different principal components on the visualization, this will give us a better idea of how to interpret the data.
For instance, when comparing the PC1 against PC2 we can get an overview of which feature correspond (is linked) to each one. If we take a look on one of the outliers, such as the hand 39, it can be seen when is compared to others that the PC1 contains the information about the distance between the thumb and the little finger. (Compare hand 30 against hand 35). Moreover, the PC2 it seems to have the information related to distance between the little and the ring finger (compare hand 39 against hand 37).
Using k means clustering it’s possible to identify points with little variance in the components. In this particular dataset as the variance is limited by the movement of the hand, there are no clearly defined clusters which can be found. By using a high k value is easier to identify similar points.
The contributors of this project were: