I was able to obtain a list of the top 100,000 URLs saved on del.icio.us, circa 2006, a social bookmarking service that was owned by Yahoo. The URLs are ranked by the the number users saving them on the website. The users use keywords (tags) to organise the URLs. Each URL is listed with the top 10 tags assigned to it by the users. The total number of unique tags used is 29,575, of them 18,298 are used less than 3 times, and only 1,020 used more than 95 times.
I then clustered the tags into 7 classes. At first, tags used less than 95 times were excluded. Then, I constructed a co-occurrence matrix. Tags were listed on both rows and columns of the matrix, and the numbers in each intersecting cell, represent the number of times the row and column tags are assigned together to the same URLs. The diagonal of the matrix represent the co-occurrence of each tag with its own self, i.e. the total number of times a tag is used to tog URLs. Then K-Means clustering was used to cluster similar tag together. We ended up with seven clusters. The following menu contains the names given to the clusters and the tag cloud displays the tags within each cluster.
Then above clusters were used later on as class-labels for the URLs. For each URL, tags assigned to it were replaced with the clusters they belong to. Non-frequent tags that were excluded earlier were also excluded here. Then a majority vote was applied to decide a single class label for each URL. We ended up with 43,223 with more confident class-labels. As you can see, del.icio.us uses are minly geeks, at least back in 2006. Two of the top three classes of URLs are computer related. The one that comes right after is less homogenious with politics, news and business related URLs in there.
I then extracted all possible n-grams of a fixed length from URLs in each class. For example, if we are having a URL www.yahoo.com, I first converted it to wwwyahoo.com, all possible 3-grams in this case will be, www, wwy, wya, yah, aho, hoo, ooc, oco and com. Then for each n between 1 and 20, I counted the number of unique n-gram per each class. In the following graph, the x-axis represents the variation in n, and the y-axes shows the number of unique n-grams for our seven class.
As seen above, the different n-gram lengths, result in differ number of unique n-grams seen in each class. Similarly, when constructing a classifier using those n-grams, the perentages of unseen n-grams in the testing phase should also vary accrding to the value of n.
The above is part of the data exploration phase of my MSc. project. It is about classifying web-pages using their URLs only. You can visit the project's slides here, feel free to randomly click on parts of each slide, there are lots of animations done using d3.js. The slides themselves are done using reveal.js.