# The Peculiar Habits of Imara Daima Train Commuters

All aboard! The train bellows in the early morning at the new Imara Daima train station ready to haul hordes of men and women to the capital. Among the daily commuters you’ll find Dennis Kioko  on his favored 7:10 am train bound for Nairobi CBD. For three years he has been riding the iron snake between the two stations and with it amassed a considerable collection of train tickets. Two months ago, I honored his call to partake in the data analysis with my usual proclivity to unravel the peculiarities of human behavior. On this premise, permit me then, in lieu of the more commonplace commentary, to suggest the character of this dramatis persona.

Voila!
Each ticket contains four pieces of information; date and time of purchase, origin and destination, price, and a serial number as shown on the image below. From observation, it seems the ‘juice’ of the information is contained in the serial number since they differ for different travel destination and time. Therefore, the quest to understand the commuting behavior begins with deciphering the serial number.

Data from Train Tickets

In data mining, the logical step to follow is feature engineering – a technique used to extract variables (features) to pass to a machine learning algorithm. In this case, I decided to split each digit in the serial number to be a variable and also split the date into day, month and, year as independent variables. This is in the anticipation that each digit in the serial number correlates with the date or destination/origin. The algorithm of choice is Bayesian Network – a model that captures conditional dependency (i.e things that require other things to be in place for them to work) among variables.  The diagram below shows the resultant Bayesian Network of the extracts features.

Bayesian Network of Extracted Features

Now to decipher the network – we have a mapping of variables showing their relative dependence on other variables with the arrow pointing to the variable that depends on the prior one. If we imagine the network as containing information propagation paths, then we may seek to find which node processes the most information, which leads us to the variable ‘first’. It has the highest number of connections thereby being the most important. It is also the only node that depends on the variable origin and does have another variable depending on it – this concludes the first digit indicates the origin point of a journey. Phweks!

Mapping Serial Number Digits to their Meaning

The variable second also has a dependency on destination but in turn it has other variables depending on it thereby by indicating a combination of these variables and the second variable indicate the destination of the journey. Now that we can figure out the origin and destination points, we need to know from which digit to start counting the number of passengers on board on each trip. We turn to a concept in network analysis known as  closeness – it tells us which variables that is a connection point to all others i.e this would be the variables where the counting sequence begins. In our data the fifth variable has the highest closeness measure. Let the fun begin!

Verdict
On picking the two most common route times for the Imara Daima – CBD train we observe that about 300,00 people (200,000 on first train and 100,000) have  been transported by rail since the opening of the new station on Dec 12, 2013 – that’s about 100,00 people every year and 8,300 every month. However, these 8,000 monthly passengers are not evenly distributed monthly – certain months have a great spike in number of passengers.  These are tied to school holidays; around March, August and November. The causality for this phenomenon is not clear but it is suspect that school tend to take students to enjoy a train ride end of each month.

The spikes in school holidays however is not the strangest part of this tale. When the reverse journey is plotted (Nairobi CDB to Imara Daima) it tells a peculiar tale. The passenger numbers drop to less than a half for the same period at about 90,0000. This shows that only a third of the passengers who take the morning train to the capital return with it home. It would be probably be tied to the train schedule in which the last train on that route leaves CBD at 19:04 – at time when most people are still working. In the wake of the Nairobi Commuter Train marking losses due to low passengers numbers, here is an untapped market.

The low number notwithstanding, the evening commuters are quite ‘loyal’. There are only slight surges in number of passengers every month. Meaning the commuters who use the 19:04 train are consistent on their usage of the train – no events, even Christmas holiday that  affects their travel schedule. What type of man is this? The previous graph shows a slow down during Christmas for the morning commute.

Nairobi to Imara Daima

In the spirit of saving the best for last, I analyzed an interesting phenomenon using the time of ticket purchase. Permit me to first provide an analogy. Think of a workplace in which employees have to report by 8:00 am. In an ordinary day, it will be observed that a few people would show up early and as the time narrows to 8:00 am people show up with the last 15 minutes having the most number of people reporting to work.

This pattern when mapped produces an exponential growth  graph. In my analysis I decide to map the time people purchase tickets in hope of catching a specific train from the departing schedule. I focussed on the first train at 7:10 am from Imara Daima to Nairobi CBD since this is the most popular ride for folks hoping to make it to the office by the standard 8:00 am reporting time.

Distribution of Arrival Time

As expected,  the plot forms close exponential curve. But that’s not the weird part – the earliest ticket purchase in anticipation for the 7:10 am ride are made at midnight. Then a couple others bought between 1:00 am and 4:00 am with the ticket purchase picking up at 5:30 am. The greatest question is, who is buying tickets at the dead of the night. I’m not even sure if the ticketing booths are open at those time. These aren’t erroneous data points since a sample of physical tickets does show purchases after midnight.

I was hoping for a visit to the Imara Daima train station at midnight to prove the facts but it never happened. So I put myself to thought on how this is possible – then I thought it might be guards who look after the station that get these tickets, or it might just be someone who takes the evening train from the CBD to Imara to work an evening shift in the area then buys a ticket after they are done with their work. All in all, this makes a good field trip to confirm the analysis – I hope to see people camping at station in jumper and hoodies waiting for the morning train 😉

1. What other variables would make for a data rich ticket? As always, thank you for an insightful piece.

Like

1. Thank you! Accompanying road traffic data would make for a compelling variable to explain if people opt for the train when there’s heavy road traffic.

Like

2. Really nice observations.

Like

3. Those tickets are pre printed by the attendants, hence they are just issued when you approach the booths. In your analysis, this should not count. Secondly, it would be interesting to map the time the tickets are swiped before entry (IF possible.)

Like

1. What’s the time lag for when the tickets are pre-printed. That would impact the interpretation of the analysis. Swiping time would be great, if KAPS would agree.

Like