My random Ramblings

Some useful bash hacks

Useful Bash Hacks

So, listed below are some of my favorite bash commands that I use very frequently.

1) Move all files of a particular type in a directory

The command below searches for all the zip files in a particular directory hierarchy and moves it to a new directory. The depth of the search can be adjusted with the maxdepth option for the find command. I often use this to clean up my Desktop.

find ./ -name “*.zip” -exec mv {} ./zipped_files    \;

For all files in a the current directory and to ignore sub-directories  –

find ./ -name “*.zip” -maxdepth 1 exec mv {} ./zipped_files    \;

2) Modify each line in a text file

The sed command is pretty useful for all kinds of stream editing. Lets say I want to surround each line in a file with a html tag. This can be easily done via the sed command. Lets say I want enclose an XML tag for each line in a text file, sed makes this pretty easy.

Input File





The command which I am going to use is

sed 's:.*:<id>&</id> sample.txt > new_file.txt

And the output file would be


3) Cut command Example

Again, this is a pretty common hack. List all users in the system along with their home directories

cut -d : -f 1,6 /etc/passwd

Near duplicate detection in streaming twitter data

The Situation:

Imagine working with a real-time streaming API where you get thousands of posts per second and you need to filter out duplicate posts on the fly. Dealing with duplicate data is bad both for computational operations and for the overall quality of data. We had this problem while working on our real time Twitter streaming data analytics engine. For our purpose, the more unique tweets we had, the better the quality of our analytics. Some instances of near-duplicates/duplicates in a twitter stream which we had encountered were

1) Marketing Campaigns : Marketing companies generally create multiple accounts and the same tweet is tweeted multiple times with very little changes.

Eg: @amandakclayton Want to bet on the #KYDerby freely, legally, and securely? Visit TwinSpires! http://t.co/Qz1BkmCb

      @Dizzale7 Want to bet on the #KYDerby freely, legally, and securely? Visit TwinSpires! http://t.co/NXGyz5es

2) Sweepstakes campaigns : This was actually a pretty interesting problem which we encountered. With companies understanding the raw viral marketing power posed through social media, sweep-stakes campaigns through twitter have gained a lot of popularity. The tweets below were for the LG  TV sweepstakes competition.

Eg :  I want the new LG CINEMA 3D TV because I love LG 3D TV! http://t.co/o4iN7wDB #LGCINEMA3D

         I want the new LG CINEMA 3D TV because I want to be one of the cool kids! http://t.co/JnzrrDyv #LGCINEMA3D

Tweets like these were’nt exactly useful for our purpose they were essentially duplicates/near duplicates

3) Retweets:  Retweets can be considered as duplicates depending on the context/purpose of our application. For applications where only the textual content of a tweet is needed, retweets are well and truly duplicates.But lets say, we want to profile the users based on their tweets, we probably should not classify retweets as duplicates

Our Requirements:

There are so many challenges while trying to solve this problem in a streaming scenario. Approaches to duplicate detection that work pretty well on a static dataset simply are not feasible to implement in a streaming setting.

Our requirements were pretty simple. We wanted an approach that was

1)  Scalable

2) Had near Linear time effeciency

3) Used less In-memory storage

The Challenges:

Clearly we do not want to check the similarity of a tweet with all the other tweets in our data store. It is highly inefficient and simply not feasible to do so for streaming data.

The other option is the classic inverted index based stores for retrieval of similar tweets. But what happens when new tweets arrive continuously ? The index needs to be updated continuously for the recent tweets. Otherwise the index would get soon outdated and re-indexing is again very expensive.

There always has to be a trade-off between precision and computational complexity. We were okay with getting marking some wrong as long as the approach was scalable. Probabilistic techniques were much better suited to our needs.

Our Approach:

Hashing based techniques are probably the most efficient for lookups. We wanted to do something like this

“Hash a tweet in such a manner that similar tweets have a high probability of having the same hash value”.

This is where the MinHash technique comes to the picture. The idea is borrowed from a really great book from Stanford called “Mining Massive Datasets” authored by Anand Rajaraman and Jeffrey D Ullman.


Prior to using minhash, we first generate shingles for a document (shingling). Shingling is essentially similar to generating ‘n grams’ for a document. For example ,for the text below, if we choose a shingle size of 2 words, the shingled output is shown adjacent to it

“The quick brown fox” <=>  set( “the quick”, “quick brown”, “brown fox”)

Shingles are a crude way of representing a document. If care is ensured to do proper pre-processing prior to shingling, then shingles of related documents are highly similar.

Some initial pre-processing techniques such as removing bit.ly urls, smilies, emoticons, punctuations etc. need to be done to improve precision while searching for similar matches.

We then use Minhash technique using multiple hashes to get a vector representation of the tweet. The idea behind Minhash is simple. If we represent a tweets (T1..Tn) as a set of shingles (t1…tn),  then the Minhash of t1 and t2 are related  to their jaccard similarity as follows.

and the Jaccard Similarity of two sets A and B is given by

So, we can infer the following from the Minhashes of t1 and t2

1)  Very similar pieces of text will have a high probability of having the same MinHash value

2)  If we choose N hash functions, then we would get a nice N dimensional vector representation for a document and  from the property of minhashes, highly similar documents will have highly similar vectors.

Calculation of Minhash:

MinHash is essentially the minimal hash value for a set of shingles. Let N be the number of dimensions for  the generated Minhash Vector for a tweet t having a shingle set S. The pseudocode to calculate the  minhash Vector is

In a matrix form , this can be represented by where each column is a hash function and each row is a shingle.

So our minhash operation would result in a vector V as shown below for a certain tweet.

Now we can see that similar documents will have a high degree of similarity (column-wise). So lets assume we have two very similar tweets that have the following MinHash vectors. We can see that if we split the vector row-wise into bins of size b, there would be several bins that have the same hash values

For ease of understanding we shall assume hash values to be simple numbers. The picture below shows the binning of our min-hash vectors.

From henceforth, I will use these terms frequently and the meaning of these terms are given below to avoid confusion

bins – A part of the Vector generated. Bins are generated by splitting the min-hash vector into equal sized chunks

hash-bucket- The value obtained after hashing the vector in a bin

Here we choose bin-size b=3 where each min-hash vector is of size 6. We then hash each vector in a bin to a hash-bucket. Vector (123,234,989) was hashed to hash-bucket 1. We can see that the two tweets have the same values across bin b1. But they differ across bin b2. So what happens is the tweets are hashed to the same hash bucket for bin b1 and different hash-buckets for b2.

If this scenario occurs, where many bin-vectors are equal for two tweets, then there is a very high probability that the two tweets could be duplicates or near duplicates.

For tweets to share same minhashes across multiple hash functions for multiple bins, it can only be because they have high degree of similarity between them.

A Recap of the whose approach in a streaming scenario

  1. For a new tweet arriving from a stream, generate a shingled representation based on chosen parameters
  2. Generate a ‘N’ dimensional representation of a tweet by min-hashing with ‘N’ hash functions as described above
  3. Split the N dimensional vector into bins and hash each vector in a bin to a hash-bucket.
  4. Check if all the hash-buckets that the tweet was hashed to are unique. If, some previous tweet was also hashed to the same hash_bucket in either one of its b bins, then these tweets can be chosen as candidate pairs for duplicates.

Obviously, the bin size b which we choose has a great influence on our accuracy.

Greater the size of b —> Higher FP rate and Higher Recall

Lesser the size  of b —-> Lesser FP rate but Lesser recall

Advantages of our Approach

  • We do not need to store the whole text in our key-value store. Since we only store the hash-representations of a tweet, the amount of in-memory storage is considerably lesser than storing whole text
  • Can easily scale in a distributed setting, if we used a distributed key-value store for storing the hashes
  • Easily fits into an elegant  hash table based storage setting.  Since a hash lookup is essentially an O(1) operation, we can check for candidate tweets just by ‘b’ number of lookups for the hash values generated for all the bins. Even though we have the latency in generating N functions for a single tweet, this can be easily solved by using map-reduce where hashing can be done in parallel.

I will post the code soon in github after making some optimizations :).

Random Twitter stuff – Get recent tweets of a user

So, recently, we have been working on a project to profile a twitter user. We needed to get the tweets of a particular set of users to test our profiling approach. The best thing about working with twitter is its pretty awesome API that is so comprehensive. I wrote a simple script to extract the recent tweets for a list of users. The code is written in Python and easy to follow.

## Retrieve a user's tweets
## Sample usage -- python get_tweets.py user_name_list.txt output.csv

import urllib2
import sys
import csv
import urllib
import simplejson
import traceback


params = {'include_entities':'true',

def check_unicode(text):
    if isinstance(text,unicode):
        text = text.encode("UTF-8","ignore")
        return text
    elif isinstance(text,str):
        text= unicode(text,"UTF-8").encode("UTF-8","ignore")
    return text    

csv_writer = csv.writer(open(sys.argv[2],"wb"))
user_list = open(sys.argv[1],"rb")
cnt =0
for each_user in user_list:
    print cnt
    encoded_url = urllib.urlencode(params)
    API_CALL = URL_LOCATION + encoded_url
    print API_CALL
        data = urllib2.urlopen(API_CALL).read()
        json_data = simplejson.loads(data)
        for each_timeline in json_data:
        # Some Crap Failed !!!
        print traceback.format_exc()

The Input file should be a file with user name in each line.

Sample I/P file:


Thanks to the awesome developers at Twitter, This hack was done in less than 10 minutes 🙂

What it means to work in a startup


Stumbled across this article in HackerNews. Was a pretty fun read. It’s been a year at Serendio and I can truly empathize with the writer’s views. Pretty much neatly sums up the joys and challenges of working in a startup.

Fascination with Gentoo Linux

So for quite some time, I have been obsessed with trying the understand the marvel that is Linux. I had used Ubuntu for quite some time and had decided to wet my hands at some other Linux Distros. I had been reading about Gentoo and am I must say I was pretty impressed 🙂

For example, setting up Ubuntu is not that challenging. Any monkey with a laptop can easily set it up cause its installation is pretty user friendly. Somehow I wanted to use a distro whose kernel could be customized according to a system’s specific needs and which exposed me to the inner workings of Linux. Hence, I have arrived at Gentoo 🙂

Setting up gentoo should be a pretty fun exercise. It would be a nice chance for me to compile the gentoo kernel specific to my system’s hardware requirements. Also, the package management system Portage in Gentoo is python based and is pretty awesome. So let me try this exercise and blog about my experience in my next post.

Hacking Twitter Data – Fun with visualizing twitter data

So then, after playing around with NetworkX Graph library and UbiGraph Visualization, I couldn’t help but feel excited about wanting to visualize twitter data. Twitter has a pretty neat API that lets you have access to a wide variety of Information. Suppose you want to search the twitter universe for a list of posts that have the word ‘Metallica’, no problem , you can easily programmatically query and retrieve the relevant results.

Tweets can be retweeted and this can be a measure of how much impact a tweet has. Also, we can identify tweeters who are higly influential in the twitter sphere. So pretty recently, this song “Why this Kolaveri di” has gone viral beyond imaginable proportions and I just wanted to visualize tweeters who have tweeted about this song (Not that I am a fan of this song :)). So, I used the twitter API to search for some 10000 tweets that had the phrase “kolaveri” and extracted all tweets that had been retweeted. In Twitter a tweet which has been retweeted can be easily identified and retweet relationship can be easily extracted by a simple regex. (Eg:   RT@David). I then modelled the retweet relationship using the NetworkX graph library in python.

So then , lets visualize what we have extracted and see if we can find something interesting from it. I had been blogging about how awesome Ubigraph is for visualizing complex networks and how easy it is for integration with python based models. I wrote a simple script for integrating the Ubigraph server (It runs as an XML-RPC server) with Networkx Library. So the Visualized Tweeter-Retweeter relationship looked something like this :)))

Twitter Data Visualization

So looks pretty cool huh :)). Now if we focus our attention to the top most section of the graph, we can actually see a pretty dense node, having a lot of edges to it. The Tweet actually originates from a user called (@fakingnews). It is a famous satire website, and we can see that many users have actually retweeted fakingnews’s original tweet. In our case, however the extracted number of nodes are pretty less when compared to Twitter’s huge microblogosphere. However, we can see how much fun it was visualizing complex networks as it gives a great insight into the underlying interacttions among continually evolving networks. Hopefully, in the future , I would post about some statistical measures for measuring influence among users in twitter networks.

God Bless Python :)))

Playing around with Ubigraph-NetworkX Library in Python

I have been interested in social media visualization for quite some time. I was looking for some great visualization tools for python based graphs. I was extremely impressed by Ubigraph’s Visualization tools (http://ubietylab.net/ubigraph/). It has a pretty awesome graph visualization UI. the coolest thing which python users will love about will be its integration to another awesome graph library in Python (http://networkx.lanl.gov/). It is a great tool for modeling complex networks.These two tools are gonna make hacking social media data so much fun 🙂

In my next post we’ll see a use a simple script to visualizing twitter data as interactive graphs

Liverpool Vs Chelsea :)

I just finished watching the Liverpool kicking Chelsea’s ass and as Red’s fan, I could ‘nt have asked for something better 🙂 .!! And What a Match it was ..!!! Chelsea having bought some of the two best Ex-Liverpool players in Torres and Mereiles and still getting a goal scored against them by Glenn Johnson was one of the biggest ironies:) ..!! Kenny Dalglish made an awesome first team lineup in choosing Rodriguez over Downing and how well did it pay off 🙂 . Well a great day to be a Reds’ Fan.

YNWA ..!!!!

Hello world!

Welcome to WordPress.com. After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.

Here are some suggestions for your first post.

  1. You can find new ideas for what to blog about by reading the Daily Post.
  2. Add PressThis to your browser. It creates a new blog post for you about any interesting  page you read on the web.
  3. Make some changes to this page, and then hit preview on the right. You can always preview any post or edit it before you share it to the world.