Data Analysis

Deep Learning

Of late, my favorite technique for outlier analysis and clustering has been Principal Component Analysis. It is an easily understood concept, applicable to any type of data, and can be visualized. However, when I divided my data into a training set, and testing set something weird happened. The variations in the data changed, i.e conclusion of a variable being an outlier does not hold if data is split. I was a bit worried, so I set out to find a means of determining if a conclusion is likely to change.

Luckily enough, someone had a similar problem back in 1966. Nathan Mantel, a biostatistician at the National Cancer Institute in the US was concerned if reported cases of Leukemia were related. He devised a method, the mantel’s test to evaluate if characteristics of a reported disease remain the same. I got an idea of applying the same test to the Principal Component Analysis.

Here is how it worked. I chunked the data into several parts, generated a PCA for the subsets, then created a distance matrix holding the variations. The mantel’s test is then applied to any two subsets to check if the conclusions are the same. The diagram below visualizes the p-value for the null hypothesis – these two conclusions are the same, with subsets of data.

mantel

As subsets with close data-points are used, the p-value stands at 0.001 which means reject the null hypothesis and adopt the alternative hypothesis – the conclusions the same. However, when subsets are far apart the p-value raise exponentially to indicate the conclusions aren’t the same. At this point, I was happy I could at the least pin point if the conclusion is bound to change and at which point.  Next I applied the same code/technique to currency trading data and produced the graph below.

forex_mantelAs observed, the p-value does not change. It means the conclusion draw from the PCA analysis of this data set does not vary as more data points are added. This is particularly interesting – it helps to understand deeper partners in data. It is a sort of measure of how a variation varies (inception) .

Resources
Nathan Mantel’s Paper
R code for this analysis

Addendum (03-03-2016): Presentation on Finding Deep Structures Utilizing Principal Component Analysis and Mantel’s Test.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s