If you were a fun of the cartoon series Dexter Laboratory you must be familiar with the phrase ‘omelette du fromage’ – yes, I found myself in an omlette du fromage situation in my own data science lab. In the episode ‘The Big Cheese, Dexter prepares for French exams by playing a recording all night that repeats the phrase omelette du fromage (cheese omelet). The following morning, it’s all he can remember . Luckily was the answer to all important questions in the world. He comes back home a celebrity but blows up his lab when the voice assistant couldn’t recognize omelette du fromage as the password.
Since I discovered Principal Component Analysis I’ve been on a roll. I used the technique to showcase how to build a trading portfolio, hedge using uncorrelated stocks, I even went ahead an explained how it can be used to implement a mean reversion strategy and find arbitrage opportunities. Doesn’t it feel nice to do so!
Having fully grasped how the analysis worked, I set to concur another type of data – text data. I collected Twitter data from the keyword #EastleighBlast , broke the tweets to individual words and applied Principal Component Analysis. To my astonishment the results were pretty good. I could visualize the crux of the conversation and event pick out words that were not part of the main conversations (outlier detection). In tandem with sentiment analysis, I could visualize which words/topics introduced negativity to a conversation.
The roller coaster ride continued. I used the same technique to analyze the hidden reload number for Airtel Scratch Card . I was able to find a pattern on how the variables related. It is a this moment that I knew I had found a silver bullet to annihilate all my data problems.
Upon further inquiry, I discovered the mantel’s permutation test that enabled me to add an extract layer on to the PCA to define deep structures in data. This became my topic of presentation a few weeks ago at a data science meetup: Finding Deep Structures in Data using PCA and Mantel’s Test
However, tonight I attempted yet another leap. To utilize PCA and mantel’s test to test similarity two images are the same. It was an epic fail! Hours spent rewriting the code did not help. If you understand PCA, you know it checks for variation using standard deviation. I realized image data (in black & white) have a very narrow set of values (between 0 and 255) and hence most images would have almost the same standard deviation. PCA + mantel’s test indicated 5 different images to be the same 🙁
Some lessons to be learned here:
- Image data in matrix form doesn’t have enough distinctive features (outliers) to uniquely identify it using PCA + mantel’s test.
- Leave image manipulation to OpenCV – it has awesome algorithms for image synthesis.
- Always use python – R has limited tricks for managing raster files.