Data Analysis

The Handshake Problem

In the past two years, iHub Research has been running Project Umati — an online dangerous speech monitoring initiative. As part of the Data Lab team, we were tasked with translating Susan Benesch framework for hate speech into a machine learning system. This entailed first collecting data via Twitter Streaming API and storing it in a database. That was the easy part. Then came the analysis.

We first realised there are too many duplicates (RTs) which are likely to skew the dataset. Removing the native RTs are easy in any programming language. But on Twitter there’s a monster known as manual retweeting — additional text before a retweet which makes them have non-uniform byte size. We needed to remove those too in-order to have truly unique entries. So, we came up with an idea. To use levenshtein distance and calculate string similarity then remove tweets that have a high degree of similarity. Voila! Problem solved…but an issue arose. The computational time for executing the code is linear to the amount of data since each tweet is compared with the rest to measure similarity. This gives a Big O notation of n*n, i.e if you have 10,000 tweets, the number of comparisions to be done are 10,000 *10,000 = 100,000,000 (10 million comparisons).

The Combinatoric Explosion

Assume every comparison takes a second, then it will take 115 days (3.8 months) to complete code execution. BOOM!!! and the combinatoric explosion happens. Time of execution scale with data until it an exponential growth in resources required forces a program termination.

Unfortunately we didn’t have three months to compare tweet similarity. Two ways presented themsleves in this conundrum — use GPUs for parallel processing or write a more efficient algorithm. The former required a significant time in learning how to do low-level C code code while the later just needed more reasoning, so we opted for the later.

The Dissection

We began by drawing the matrix created during the manipulation and finding where we could optimize the process. First we recognized each tweet get’s compared by itself, this wasn’t useful since we already knew the outcome. So we sort to avoid self-comparison. In code implementation, this was done by executing when i!=j. For N tweets, the number of comparisons to be done at this point is n*n-n. Progress, but not good enough.

After looking at the indices comparison we came to the conclusion that there’s a complement comparison i.e comparing index 1 and 5 is the same as comparing 5 and 1. Sigh of relief, if we could remove this complementary comparision we can half the time for execution. We knew where to do the optimization but scratched our head on how to implement it on code. So we went back to study the matrix. Adding indices below the diagonal always produced a negative number while indices above produced a positive integer. The light bulb when on and the apple dropped — we had solved it.

The Implementation

We figured out setting i-j>0 we could ignore computation below the diagonal. Fist bumps made rounds in the office, we now had if (i!=j & i-j >0) as oursentinel valueComputationally speaking the lower half of the matrix represented half the values hence n/2 comparisons. Total number of comparisons at this juncture is (n*n-n)-n/2 — an almost 55 percent reduction in computation time. Then we simplified the formula and someone said he had seen the formula before.

It then hit me, it’s the Handshake Problem — which states, if there are N people in a party, the total number of mutual handshakes in the party will be represented by the formula above. I felt jovial and dumb at the same time, looking at it in retrospect each tweet represent a person, and a person can not shake their own hand, also shaking another person’s hand is the same as that person shaking your hand.

The Slicer

We are glad to have figured out the optimisation, and now its part of ourslicer code which we use to returns highly relevant tweets during an event on Twitter.

This code is part of Umati Codebase which is available on GitHub under Apache License 2.0



  1. a good one there , i see you are on track to be like …. zuba what is his name the face book guy. i admire your black guy


  2. Hi there, so nice to see an interesting use of stringdist!

    Also, nice complexity analyses, it really helps to have insight in running time _before_ starting the computer :-).

    You can make the code even faster by using ‘recycling’. That way we can avoid the inner R-loop. This is good because R-loops are typically much slower than recycling which is a loop done in the underlying C code. For example, if you do

    stringdist(‘foo’, c(‘fu’,’bar’))

    you get the distances between ‘foo’ and ‘fu’ and between ‘foo’ and ‘bar’. The shortest argument, namely ‘foo’, is repeated by the underlying C-library.

    So we can use this to avoid the inner loop, but we need to be a bit clever in how to use recycling. This means computing with indices, which is always a bit of a chore. Below is a function that I think does the same as your’s but should be faster.

    Now the problem will still grow with n*(n-1)/2, but by using recycling we decreased it by a constant factor, which is at least something.

    Best wishes,

    filter_duplicates <- function(text_data){
    # test if stringdist is installed

    # reserve memory for the index once, to gain speed. This creates
    # a vector of FALSE's of length(text_data) <- logical(length(text_data))

    # for the outer loop, we loop from 2…length(text_data)
    outer.loop <- seq_len(length(text_data))[-1]

    # we use this index for the inner loop, selecting the indices we need each time
    inner.loop = seq_len(length(text_data))

    for ( i in outer.loop ){
    # select only those indices we want to compare with
    j <- inner.loop[inner.loop < i]

    # compute distance of text_data[i] with all relevant text_data[j]'s
    lv.dist = stringdist(text_data[i],text_data[j],method='lv')

    # now check if any distance between text_data[i] and all the text_data[j]
    # is too small, and if so: add i to
    if ( any( lv.dist <= 140-nchar(RemoveLink(text_data[i]) )) ){[i] <- TRUE
    # I return the logical index here, you can do 'which(' to
    # get the numbers


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s