Data Analysis

Airtel Scratch Card Analysis

One of my blog readers Johnson Kamotho sent me data of his Airtel scratch card collection and asked if I could perform a similar analysis to the Safaricom blog post.  I was up to the ask and  opted to apply a couple of techniques I’ve been trying out to reveal the underlying structure of data sets. The provided data had each digit separated as a variable which made me breathe a sigh of relief – no data preparation required!

airtelLet the jargon begin. I opted to use PCA (Principal Component Analysis) given its good way of capturing varying data points. It is actually used in facial recognition to ignore varying aspects of an image such as lighting, color, e.t.c and represent a human face with the simplest points hence ease of comparison.

Principal Component Analysis
Principal Component Analysis (PCA) is a statistical technique used to measure variation in interdependent variables. This technique utilizes four major computations; first calculate the variance (standard deviation) of an observation – this gives how far a set of numbers is spread out. Second compute covariance which measure how two random variables change together – in essence comparing the variances of two fields (in our case the positions ). Third and of more importance is building an  N x N  matrix holding all the covariances (here 14 by 14 matrix ), then extract eigen vectors  from the matrix (a subset of the matrix that doesn’t change even if more data is added).

If you are lost at this point we are simply getting a representation of the data that remains consistent when more data is added – sort of getting the bone structure of our data. Good thing, you don’t have to perform all these steps,  you can do this in one line of code in R programming language.

#load data
a = read.csv('airtel.csv')

#perform PCA
c = prcomp(a)

So, I performed a PCA analysis on the data and visualized it in 2 dimensions (first two principal components) as shown below.

artel_pcaFrom the graph above, the variables ten, six and thirteen have a bigger separation (variation) from the rest in addition to the having little variance among them.Now, this three variables became a point of interest and I probed further.

Network Analysis
Network analysis is a technique of inferring interconnections in data point. It stems from Network theory, a  concept in Computer Science that borrows ideas on connections, pairs and sets from the mathematical field of Graph theory.  In order to perform network analysis on the transformed data, I re-imagine the eigen-vector calculated by the PCA as a euclidean space . Simply put, I calculate the distance between the points shown in the diagram above.

An N x N matrix is created to hold the euclidean distance between the 14 variables and passed to a plotting function that can visualize the matrix as a graph (network). If the distance is less than zero it is omitted in-order to only bring out the strong patterns. Again, R programming language is quiet handy at this type of work, continuing from the above code, the following code snippet does the magic.


#load library
library(igraph)

#load data
a = read.csv('airtel.csv')

#pca analysis
c = prcomp(a)
d = as.data.frame(c$rotation)

#use 1st and 2nd PC only
e = d[c('PC1','PC2')]

#define euclidean function
euclidean = function(x2,y2,x1,y1){
  a = (x2-x1)*(x2-x1)
  b = (y2-y1)*(y2-y1)
  return(sqrt(a+b))
}

#create empty matrix
m = matrix(nrow = nrow(e),ncol = nrow(e))
colnames(m) = rownames(e)
rownames(m) = rownames(e) #populate matrix with distance for (i in 1:nrow(e)){ for (j in 1:nrow(e)){ if(i!=j & i-j>0){
      m[i,j] = 0
    } else{
      m[i,j] = round(euclidean(e$PC2[i],e$PC1[i],e$PC2[j],e$PC1[j]),digits = 0)
    }
  }
}

#create network graph of matrix
g = graph.adjacency(adjmatrix = m,mode = 'directed',weighted = TRUE)

#plot the graph
plot(g)

 

So what do we get?

airtel_network

The network diagram above shows six, ten, and thirteen being at the center of the graph with the most connections hence of great importance. These attribute (six, ten, thirteen) may have the seed value. In random number generation algorithms, a seed is  a starting point for sequences and guarantees that if you start from the same seed you will get the same sequence of numbers.

Now, off to calculating the formula joining the three attributes and I’ll have my seed.

You can access the dataset here.

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s