In the past two years, iHub Research has been running Project Umati — an online dangerous speech monitoring initiative. As part of the Data Lab team, we were tasked with translating Susan Benesch framework for hate speech into a machine learning system. This entailed first collecting data via Twitter Streaming API and storing it in a database. That was the easy part. Then came the analysis.
We first realised there are too many duplicates (RTs) which are likely to skew the dataset. Removing the native RTs are easy in any programming language. But on Twitter there’s a monster known as manual retweeting — additional text before a retweet which makes them have non-uniform byte size. We needed to remove those too in-order to have truly unique entries. So, we came up with an idea. To use levenshtein distance and calculate string similarity then remove tweets that have a high degree of similarity. Voila! Problem solved…but an issue arose. The computational time for executing the code is linear to the amount of data since each tweet is compared with the rest to measure similarity. This gives a Big O notation of n*n, i.e if you have 10,000 tweets, the number of comparisions to be done are 10,000 *10,000 = 100,000,000 (10 million comparisons).
The Combinatoric Explosion
Assume every comparison takes a second, then it will take 115 days (3.8 months) to complete code execution. BOOM!!! and the combinatoric explosion happens. Time of execution scale with data until it an exponential growth in resources required forces a program termination.
Unfortunately we didn’t have three months to compare tweet similarity. Two ways presented themsleves in this conundrum — use GPUs for parallel processing or write a more efficient algorithm. The former required a significant time in learning how to do low-level C code code while the later just needed more reasoning, so we opted for the later.
We began by drawing the matrix created during the manipulation and finding where we could optimize the process. First we recognized each tweet get’s compared by itself, this wasn’t useful since we already knew the outcome. So we sort to avoid self-comparison. In code implementation, this was done by executing when i!=j. For N tweets, the number of comparisons to be done at this point is n*n-n. Progress, but not good enough.
After looking at the indices comparison we came to the conclusion that there’s a complement comparison i.e comparing index 1 and 5 is the same as comparing 5 and 1. Sigh of relief, if we could remove this complementary comparision we can half the time for execution. We knew where to do the optimization but scratched our head on how to implement it on code. So we went back to study the matrix. Adding indices below the diagonal always produced a negative number while indices above produced a positive integer. The light bulb when on and the apple dropped — we had solved it.
We figured out setting i-j>0 we could ignore computation below the diagonal. Fist bumps made rounds in the office, we now had if (i!=j & i-j >0) as oursentinel value. Computationally speaking the lower half of the matrix represented half the values hence n/2 comparisons. Total number of comparisons at this juncture is (n*n-n)-n/2 — an almost 55 percent reduction in computation time. Then we simplified the formula and someone said he had seen the formula before.
It then hit me, it’s the Handshake Problem — which states, if there are N people in a party, the total number of mutual handshakes in the party will be represented by the formula above. I felt jovial and dumb at the same time, looking at it in retrospect each tweet represent a person, and a person can not shake their own hand, also shaking another person’s hand is the same as that person shaking your hand.
We are glad to have figured out the optimisation, and now its part of ourslicer code which we use to returns highly relevant tweets during an event on Twitter.
This code is part of Umati Codebase which is available on GitHub under Apache License 2.0