My random Ramblings

Category: python

Near duplicate detection in streaming twitter data

The Situation:

Imagine working with a real-time streaming API where you get thousands of posts per second and you need to filter out duplicate posts on the fly. Dealing with duplicate data is bad both for computational operations and for the overall quality of data. We had this problem while working on our real time Twitter streaming data analytics engine. For our purpose, the more unique tweets we had, the better the quality of our analytics. Some instances of near-duplicates/duplicates in a twitter stream which we had encountered were

1) Marketing Campaigns : Marketing companies generally create multiple accounts and the same tweet is tweeted multiple times with very little changes.

Eg: @amandakclayton Want to bet on the #KYDerby freely, legally, and securely? Visit TwinSpires! http://t.co/Qz1BkmCb

      @Dizzale7 Want to bet on the #KYDerby freely, legally, and securely? Visit TwinSpires! http://t.co/NXGyz5es

2) Sweepstakes campaigns : This was actually a pretty interesting problem which we encountered. With companies understanding the raw viral marketing power posed through social media, sweep-stakes campaigns through twitter have gained a lot of popularity. The tweets below were for the LG  TV sweepstakes competition.

Eg :  I want the new LG CINEMA 3D TV because I love LG 3D TV! http://t.co/o4iN7wDB #LGCINEMA3D

         I want the new LG CINEMA 3D TV because I want to be one of the cool kids! http://t.co/JnzrrDyv #LGCINEMA3D

Tweets like these were’nt exactly useful for our purpose they were essentially duplicates/near duplicates

3) Retweets:  Retweets can be considered as duplicates depending on the context/purpose of our application. For applications where only the textual content of a tweet is needed, retweets are well and truly duplicates.But lets say, we want to profile the users based on their tweets, we probably should not classify retweets as duplicates

Our Requirements:

There are so many challenges while trying to solve this problem in a streaming scenario. Approaches to duplicate detection that work pretty well on a static dataset simply are not feasible to implement in a streaming setting.

Our requirements were pretty simple. We wanted an approach that was

1)  Scalable

2) Had near Linear time effeciency

3) Used less In-memory storage

The Challenges:

Clearly we do not want to check the similarity of a tweet with all the other tweets in our data store. It is highly inefficient and simply not feasible to do so for streaming data.

The other option is the classic inverted index based stores for retrieval of similar tweets. But what happens when new tweets arrive continuously ? The index needs to be updated continuously for the recent tweets. Otherwise the index would get soon outdated and re-indexing is again very expensive.

There always has to be a trade-off between precision and computational complexity. We were okay with getting marking some wrong as long as the approach was scalable. Probabilistic techniques were much better suited to our needs.

Our Approach:

Hashing based techniques are probably the most efficient for lookups. We wanted to do something like this

“Hash a tweet in such a manner that similar tweets have a high probability of having the same hash value”.

This is where the MinHash technique comes to the picture. The idea is borrowed from a really great book from Stanford called “Mining Massive Datasets” authored by Anand Rajaraman and Jeffrey D Ullman.


Prior to using minhash, we first generate shingles for a document (shingling). Shingling is essentially similar to generating ‘n grams’ for a document. For example ,for the text below, if we choose a shingle size of 2 words, the shingled output is shown adjacent to it

“The quick brown fox” <=>  set( “the quick”, “quick brown”, “brown fox”)

Shingles are a crude way of representing a document. If care is ensured to do proper pre-processing prior to shingling, then shingles of related documents are highly similar.

Some initial pre-processing techniques such as removing bit.ly urls, smilies, emoticons, punctuations etc. need to be done to improve precision while searching for similar matches.

We then use Minhash technique using multiple hashes to get a vector representation of the tweet. The idea behind Minhash is simple. If we represent a tweets (T1..Tn) as a set of shingles (t1…tn),  then the Minhash of t1 and t2 are related  to their jaccard similarity as follows.

and the Jaccard Similarity of two sets A and B is given by

So, we can infer the following from the Minhashes of t1 and t2

1)  Very similar pieces of text will have a high probability of having the same MinHash value

2)  If we choose N hash functions, then we would get a nice N dimensional vector representation for a document and  from the property of minhashes, highly similar documents will have highly similar vectors.

Calculation of Minhash:

MinHash is essentially the minimal hash value for a set of shingles. Let N be the number of dimensions for  the generated Minhash Vector for a tweet t having a shingle set S. The pseudocode to calculate the  minhash Vector is

In a matrix form , this can be represented by where each column is a hash function and each row is a shingle.

So our minhash operation would result in a vector V as shown below for a certain tweet.

Now we can see that similar documents will have a high degree of similarity (column-wise). So lets assume we have two very similar tweets that have the following MinHash vectors. We can see that if we split the vector row-wise into bins of size b, there would be several bins that have the same hash values

For ease of understanding we shall assume hash values to be simple numbers. The picture below shows the binning of our min-hash vectors.

From henceforth, I will use these terms frequently and the meaning of these terms are given below to avoid confusion

bins – A part of the Vector generated. Bins are generated by splitting the min-hash vector into equal sized chunks

hash-bucket- The value obtained after hashing the vector in a bin

Here we choose bin-size b=3 where each min-hash vector is of size 6. We then hash each vector in a bin to a hash-bucket. Vector (123,234,989) was hashed to hash-bucket 1. We can see that the two tweets have the same values across bin b1. But they differ across bin b2. So what happens is the tweets are hashed to the same hash bucket for bin b1 and different hash-buckets for b2.

If this scenario occurs, where many bin-vectors are equal for two tweets, then there is a very high probability that the two tweets could be duplicates or near duplicates.

For tweets to share same minhashes across multiple hash functions for multiple bins, it can only be because they have high degree of similarity between them.

A Recap of the whose approach in a streaming scenario

  1. For a new tweet arriving from a stream, generate a shingled representation based on chosen parameters
  2. Generate a ‘N’ dimensional representation of a tweet by min-hashing with ‘N’ hash functions as described above
  3. Split the N dimensional vector into bins and hash each vector in a bin to a hash-bucket.
  4. Check if all the hash-buckets that the tweet was hashed to are unique. If, some previous tweet was also hashed to the same hash_bucket in either one of its b bins, then these tweets can be chosen as candidate pairs for duplicates.

Obviously, the bin size b which we choose has a great influence on our accuracy.

Greater the size of b —> Higher FP rate and Higher Recall

Lesser the size  of b —-> Lesser FP rate but Lesser recall

Advantages of our Approach

  • We do not need to store the whole text in our key-value store. Since we only store the hash-representations of a tweet, the amount of in-memory storage is considerably lesser than storing whole text
  • Can easily scale in a distributed setting, if we used a distributed key-value store for storing the hashes
  • Easily fits into an elegant  hash table based storage setting.  Since a hash lookup is essentially an O(1) operation, we can check for candidate tweets just by ‘b’ number of lookups for the hash values generated for all the bins. Even though we have the latency in generating N functions for a single tweet, this can be easily solved by using map-reduce where hashing can be done in parallel.

I will post the code soon in github after making some optimizations :).

Random Twitter stuff – Get recent tweets of a user

So, recently, we have been working on a project to profile a twitter user. We needed to get the tweets of a particular set of users to test our profiling approach. The best thing about working with twitter is its pretty awesome API that is so comprehensive. I wrote a simple script to extract the recent tweets for a list of users. The code is written in Python and easy to follow.

## Retrieve a user's tweets
## Sample usage -- python get_tweets.py user_name_list.txt output.csv

import urllib2
import sys
import csv
import urllib
import simplejson
import traceback


params = {'include_entities':'true',

def check_unicode(text):
    if isinstance(text,unicode):
        text = text.encode("UTF-8","ignore")
        return text
    elif isinstance(text,str):
        text= unicode(text,"UTF-8").encode("UTF-8","ignore")
    return text    

csv_writer = csv.writer(open(sys.argv[2],"wb"))
user_list = open(sys.argv[1],"rb")
cnt =0
for each_user in user_list:
    print cnt
    encoded_url = urllib.urlencode(params)
    API_CALL = URL_LOCATION + encoded_url
    print API_CALL
        data = urllib2.urlopen(API_CALL).read()
        json_data = simplejson.loads(data)
        for each_timeline in json_data:
        # Some Crap Failed !!!
        print traceback.format_exc()

The Input file should be a file with user name in each line.

Sample I/P file:


Thanks to the awesome developers at Twitter, This hack was done in less than 10 minutes 🙂

Hacking Twitter Data – Fun with visualizing twitter data

So then, after playing around with NetworkX Graph library and UbiGraph Visualization, I couldn’t help but feel excited about wanting to visualize twitter data. Twitter has a pretty neat API that lets you have access to a wide variety of Information. Suppose you want to search the twitter universe for a list of posts that have the word ‘Metallica’, no problem , you can easily programmatically query and retrieve the relevant results.

Tweets can be retweeted and this can be a measure of how much impact a tweet has. Also, we can identify tweeters who are higly influential in the twitter sphere. So pretty recently, this song “Why this Kolaveri di” has gone viral beyond imaginable proportions and I just wanted to visualize tweeters who have tweeted about this song (Not that I am a fan of this song :)). So, I used the twitter API to search for some 10000 tweets that had the phrase “kolaveri” and extracted all tweets that had been retweeted. In Twitter a tweet which has been retweeted can be easily identified and retweet relationship can be easily extracted by a simple regex. (Eg:   RT@David). I then modelled the retweet relationship using the NetworkX graph library in python.

So then , lets visualize what we have extracted and see if we can find something interesting from it. I had been blogging about how awesome Ubigraph is for visualizing complex networks and how easy it is for integration with python based models. I wrote a simple script for integrating the Ubigraph server (It runs as an XML-RPC server) with Networkx Library. So the Visualized Tweeter-Retweeter relationship looked something like this :)))

Twitter Data Visualization

So looks pretty cool huh :)). Now if we focus our attention to the top most section of the graph, we can actually see a pretty dense node, having a lot of edges to it. The Tweet actually originates from a user called (@fakingnews). It is a famous satire website, and we can see that many users have actually retweeted fakingnews’s original tweet. In our case, however the extracted number of nodes are pretty less when compared to Twitter’s huge microblogosphere. However, we can see how much fun it was visualizing complex networks as it gives a great insight into the underlying interacttions among continually evolving networks. Hopefully, in the future , I would post about some statistical measures for measuring influence among users in twitter networks.

God Bless Python :)))