**The Situation**:

Imagine working with a real-time streaming API where you get thousands of posts per second and you need to filter out duplicate posts on the fly. Dealing with duplicate data is bad both for computational operations and for the overall quality of data. We had this problem while working on our real time Twitter streaming data analytics engine. For our purpose, the more unique tweets we had, the better the quality of our analytics. Some instances of near-duplicates/duplicates in a twitter stream which we had encountered were

1) **Marketing Campaigns** : Marketing companies generally create multiple accounts and the same tweet is tweeted multiple times with very little changes.

*Eg:*** @amandakclayton Want to bet on the #KYDerby freely, legally, and securely? Visit TwinSpires! http://t.co/Qz1BkmCb**

* @Dizzale7 Want to bet on the #KYDerby freely, legally, and securely? Visit TwinSpires! http://t.co/NXGyz5es*

2) **Sweepstakes campaigns** : This was actually a pretty interesting problem which we encountered. With companies understanding the raw viral marketing power posed through social media, sweep-stakes campaigns through twitter have gained a lot of popularity. The tweets below were for the LG TV sweepstakes competition.

Eg :* *** I want the new LG CINEMA 3D TV because I love LG 3D TV! http://t.co/o4iN7wDB #LGCINEMA3D**

* I want the new LG CINEMA 3D TV because I want to be one of the cool kids! http://t.co/JnzrrDyv #LGCINEMA3D*

Tweets like these were’nt exactly useful for our purpose they were essentially duplicates/near duplicates

3) **Retweets**: Retweets can be considered as duplicates depending on the context/purpose of our application. For applications where only the textual content of a tweet is needed, retweets are well and truly duplicates.But lets say, we want to profile the users based on their tweets, we probably should not classify retweets as duplicates

**Our Requirements:**

There are so many challenges while trying to solve this problem in a streaming scenario. Approaches to duplicate detection that work pretty well on a static dataset simply are not feasible to implement in a streaming setting.

Our requirements were pretty simple. We wanted an approach that was

1) Scalable

2) Had near Linear time effeciency

3) Used less In-memory storage

**The Challenges:**

Clearly we do not want to check the similarity of a tweet with all the other tweets in our data store. It is highly inefficient and simply not feasible to do so for streaming data.

The other option is the classic inverted index based stores for retrieval of similar tweets. But what happens when new tweets arrive continuously ? The index needs to be updated continuously for the recent tweets. Otherwise the index would get soon outdated and re-indexing is again very expensive.

There always has to be a trade-off between precision and computational complexity. We were okay with getting marking some wrong as long as the approach was scalable. Probabilistic techniques were much better suited to our needs.

**Our Approach:**

Hashing based techniques are probably the most efficient for lookups. We wanted to do something like this

*“Hash a tweet in such a manner that similar tweets have a high probability of having the same hash value”*.

This is where the MinHash technique comes to the picture. The idea is borrowed from a really great book from Stanford called **“Mining Massive Datasets”** authored by Anand Rajaraman and Jeffrey D Ullman.

**Minhash: **

Prior to using minhash, we first generate shingles for a document (shingling). Shingling is essentially similar to generating ‘n grams’ for a document. For example ,for the text below, if we choose a shingle size of 2 words, the shingled output is shown adjacent to it

“The quick brown fox” <=> set( “the quick”, “quick brown”, “brown fox”)

Shingles are a crude way of representing a document. If care is ensured to do proper pre-processing prior to shingling, then shingles of related documents are highly similar.

Some initial pre-processing techniques such as removing bit.ly urls, smilies, emoticons, punctuations etc. need to be done to improve precision while searching for similar matches.

We then use Minhash technique using multiple hashes to get a vector representation of the tweet. The idea behind Minhash is simple. If we represent a tweets (*T1..Tn*) as a set of shingles (*t1…tn*), then the Minhash of *t1* and *t2* are related to their jaccard similarity as follows.

and the Jaccard Similarity of two sets *A* and B is given by

So, we can infer the following from the Minhashes of *t1* and *t2*

1) Very similar pieces of text will have a high probability of having the same MinHash value

2) If we choose *N* hash functions, then we would get a nice *N *dimensional vector representation for a document and from the property of minhashes, highly similar documents will have highly similar vectors.

**Calculation of Minhash:**

MinHash is essentially the minimal hash value for a set of shingles. Let *N* be the number of dimensions for the generated Minhash Vector for a tweet *t *having a shingle set *S*. The pseudocode to calculate the minhash Vector is

In a matrix form , this can be represented by where each column is a hash function and each row is a shingle.

So our minhash operation would result in a vector V as shown below for a certain tweet.

Now we can see that similar documents will have a high degree of similarity (column-wise). So lets assume we have two very similar tweets that have the following MinHash vectors. We can see that if we split the vector row-wise into bins of size b, there would be several bins that have the same hash values

For ease of understanding we shall assume hash values to be simple numbers. The picture below shows the binning of our min-hash vectors.

From henceforth, I will use these terms frequently and the meaning of these terms are given below to avoid confusion

*bins* – A part of the Vector generated. Bins are generated by splitting the min-hash vector into equal sized chunks

*hash-bucket-* The value obtained after hashing the vector in a bin

Here we choose bin-size *b*=3 where each min-hash vector is of size 6. We then hash each vector in a bin to a hash-bucket. Vector (123,234,989) was hashed to hash-bucket 1. We can see that the two tweets have the same values across bin* b1*. But they differ across bin *b2.* So what happens is the tweets are hashed to the same hash bucket for bin *b1* and different hash-buckets for *b2. *

If this scenario occurs, where many bin-vectors are equal for two tweets, then there is a very high probability that the two tweets could be duplicates or near duplicates.

For tweets to share same minhashes across multiple hash functions for multiple bins, it can only be because they have high degree of similarity between them.

**A Recap of the whose approach in a streaming scenario**

- For a new tweet arriving from a stream, generate a shingled representation based on chosen parameters
- Generate a
*‘N’* dimensional representation of a tweet by min-hashing with *‘N’* hash functions as described above
- Split the
*N *dimensional vector into *b *bins and hash each vector in a bin to a hash-bucket.
- Check if all the hash-buckets that the tweet was hashed to are unique. If, some previous tweet was also hashed to the same hash_bucket in either one of its
*b *bins, then these tweets can be chosen as candidate pairs for duplicates.

Obviously, the bin size *b *which we choose has a great influence on our accuracy.

Greater the size of *b —> *Higher FP rate and Higher Recall

Lesser the size of *b —-> *Lesser FP rate but Lesser recall

**Advantages of our Approach**

- We do not need to store the whole text in our key-value store. Since we only store the hash-representations of a tweet, the amount of in-memory storage is considerably lesser than storing whole text
- Can easily scale in a distributed setting, if we used a distributed key-value store for storing the hashes
- Easily fits into an elegant hash table based storage setting. Since a hash lookup is essentially an O(1) operation, we can check for candidate tweets just by
*‘b’* number of lookups for the hash values generated for all the bins. Even though we have the latency in generating *N *functions for a single tweet, this can be easily solved by using map-reduce where hashing can be done in parallel.

I will post the code soon in github after making some optimizations :).