Of late, my favorite technique for outlier analysis and clustering has been Principal Component Analysis. It is an easily understood concept, applicable to any type of data, and can be visualized. However, when I divided my data into a training set, and testing set something weird happened. The variations in the data changed, i.e conclusion of a variable being an outlier does not hold if data is split. I was a bit worried, so I set out to find a means of determining if a conclusion is likely to change.
Luckily enough, someone had a similar problem back in 1966. Nathan Mantel, a biostatistician at the National Cancer Institute in the US was concerned if reported cases of Leukemia were related. He devised a method, the mantel’s test to evaluate if characteristics of a reported disease remain the same. I got an idea of applying the same test to the Principal Component Analysis.
Here is how it worked. I chunked the data into several parts, generated a PCA for the subsets, then created a distance matrix holding the variations. The mantel’s test is then applied to any two subsets to check if the conclusions are the same. The diagram below visualizes the p-value for the null hypothesis – these two conclusions are the same, with subsets of data.
As subsets with close data-points are used, the p-value stands at 0.001 which means reject the null hypothesis and adopt the alternative hypothesis – the conclusions the same. However, when subsets are far apart the p-value raise exponentially to indicate the conclusions aren’t the same. At this point, I was happy I could at the least pin point if the conclusion is bound to change and at which point. Next I applied the same code/technique to currency trading data and produced the graph below.
As observed, the p-value does not change. It means the conclusion draw from the PCA analysis of this data set does not vary as more data points are added. This is particularly interesting – it helps to understand deeper partners in data. It is a sort of measure of how a variation varies (inception) .
Addendum (03-03-2016): Presentation on Finding Deep Structures Utilizing Principal Component Analysis and Mantel’s Test.