Data Analysis

Omelette du Fromage

If you were a fun of the cartoon series Dexter Laboratory you must be familiar with the phrase ‘omelette du fromage’  – yes, I found myself in an omlette du fromage situation in my own data science lab. In the episode ‘The Big Cheese, Dexter prepares for French exams  by playing a recording all night that repeats the phrase omelette du fromage (cheese omelet).  The following morning, it’s all he can remember . Luckily was the answer to all important questions in the world.  He comes back home a celebrity but blows up his lab when the voice assistant couldn’t recognize omelette du fromage as the password.

Since I discovered Principal Component Analysis I’ve been on a roll. I used the technique to showcase how to build a trading portfolio, hedge using uncorrelated stocks,  I even went ahead an explained how it can be used to implement a mean reversion strategy and find arbitrage opportunities. Doesn’t it feel nice to do so!

Principal Component Analysis on Currencies Data
Principal Component Analysis on Currencies Data

Having fully grasped how the analysis worked, I set to concur another type of data – text data. I collected Twitter data from the keyword #EastleighBlast , broke the tweets to individual words and applied Principal Component Analysis. To my astonishment the results were pretty good. I could visualize the crux of the conversation and event pick out words that were not part of the main conversations (outlier detection). In tandem with sentiment analysis, I could visualize which words/topics introduced negativity to a conversation.

Analysis of Words Variations using PCA on #EastleighBlast Data
Analysis of Words Variations using PCA on #EastleighBlast Data

The roller coaster ride continued. I used the same technique to analyze the hidden reload number for  Airtel Scratch Card . I was able to find a pattern on how the variables related. It is a this moment that I knew I had found a silver bullet to annihilate all my data problems.

artel_pcaUpon further inquiry, I discovered the mantel’s permutation test that enabled me to add an extract layer on to the PCA to define deep structures in data. This became my topic of presentation a few weeks ago at a  data science meetup: Finding Deep Structures in Data using PCA and Mantel’s Test

However, tonight I attempted yet another leap. To utilize PCA and mantel’s test to test similarity two images are the same. It was an epic fail! Hours spent rewriting the code did not help. If you understand PCA, you know it checks for variation using standard deviation. I realized image data (in black & white) have a very narrow set of values (between 0 and 255) and hence most images would have almost the same standard deviation. PCA + mantel’s test indicated 5 different images to be the same 😦

Some lessons to be learned here:

  • Image data in matrix form doesn’t have enough distinctive features (outliers) to uniquely identify it using PCA + mantel’s test.
  • Leave image manipulation to OpenCV – it has awesome algorithms for image synthesis.
  • Always use python – R has limited tricks for managing raster files.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s