One of my blog readers Johnson Kamotho sent me data of his Airtel scratch card collection and asked if I could perform a similar analysis to the Safaricom blog post. I was up to the ask and opted to apply a couple of techniques I’ve been trying out to reveal the underlying structure of data sets. The provided data had each digit separated as a variable which made me breathe a sigh of relief – no data preparation required!

Let the jargon begin. I opted to use PCA (Principal Component Analysis) given its good way of capturing varying data points. It is actually used in facial recognition to ignore varying aspects of an image such as lighting, color, e.t.c and represent a human face with the simplest points hence ease of comparison.

**Principal Component Analysis**

Principal Component Analysis (PCA) is a statistical technique used to measure variation in interdependent variables. This technique utilizes four major computations; first calculate the variance *(standard deviation)* of an observation – this gives how far a set of numbers is spread out. Second compute covariance which measure how two random variables change together – in essence comparing the variances of two fields *(in our case the positions )*. Third and of more importance is building an N x N matrix holding all the covariances *(here 14 by 14 matrix )*, then extract eigen vectors from the matrix (a subset of the matrix that doesn’t change even if more data is added).

If you are lost at this point we are simply getting a representation of the data that remains consistent when more data is added – sort of getting the bone structure of our data. Good thing, you don’t have to perform all these steps, you can do this in one line of code in R programming language.

[sourcecode language=”r”]

#load data

a = read.csv(‘airtel.csv’)

#perform PCA

c = prcomp(a)

[/sourcecode]

So, I performed a PCA analysis on the data and visualized it in 2 dimensions (first two principal components) as shown below.

From the graph above, the variables *ten, six* and* thirteen* have a bigger separation (variation) from the rest in addition to the having little variance among them.Now, this three variables became a point of interest and I probed further.

**Network Analysis**

Network analysis is a technique of inferring interconnections in data point. It stems from Network theory, a concept in Computer Science that borrows ideas on connections, pairs and sets from the mathematical field of Graph theory. In order to perform network analysis on the transformed data, I re-imagine the *eigen-vector* calculated by the PCA as a euclidean space . Simply put, I calculate the distance between the points shown in the diagram above.

An N x N matrix is created to hold the euclidean distance between the 14 variables and passed to a plotting function that can visualize the matrix as a graph *(network)*. If the distance is less than zero it is omitted in-order to only bring out the strong patterns. Again, R programming language is quiet handy at this type of work, continuing from the above code, the following code snippet does the magic.

[sourcecode language=”r”]

#load library

library(igraph)

#load data

a = read.csv(‘airtel.csv’)

#pca analysis

c = prcomp(a)

d = as.data.frame(c$rotation)

#use 1st and 2nd PC only

e = d[c(‘PC1′,’PC2’)]

#define euclidean function

euclidean = function(x2,y2,x1,y1){

a = (x2-x1)*(x2-x1)

b = (y2-y1)*(y2-y1)

return(sqrt(a+b))

}

#create empty matrix

m = matrix(nrow = nrow(e),ncol = nrow(e))

colnames(m) = rownames(e)

rownames(m) = rownames(e) #populate matrix with distance for (i in 1:nrow(e)){ for (j in 1:nrow(e)){ if(i!=j &amp; i-j&gt;0){

m[i,j] = 0

} else{

m[i,j] = round(euclidean(e$PC2[i],e$PC1[i],e$PC2[j],e$PC1[j]),digits = 0)

}

}

}

#create network graph of matrix

g = graph.adjacency(adjmatrix = m,mode = ‘directed’,weighted = TRUE)

#plot the graph

plot(g)

[/sourcecode]

So what do we get?

The network diagram above shows *six, ten,* and *thirteen* being at the center of the graph with the most connections hence of great importance. These attribute *(six, ten, thirteen)* may have the seed value. In random number generation algorithms, a seed is a starting point for sequences and guarantees that if you start from the same seed you will get the same sequence of numbers.

Now, off to calculating the formula joining the three attributes and I’ll have my seed.

You can access the dataset here.

## One Comment