URL-based Web-Page Classification

Using n-gram Language Models

Tarek Amr Abdallah   &  Beatriz de La Iglesia




Why URL-based classification?

- Millions of Links in Emails and Social Media

- Spam and Web Filters

Issues with URL-based classification

- Very Short and Concise documents to Classify

- Can be Composed of Concatenated Words


Build a scalable and accurate model for classifying web-pages using their URLs only.


- Study the existing approaches.

- Present the n-gram Language Model (LM) for classification.

- Highlight the different parameters for the n-gram LM.

- Validate the n-gram LM using different datasets.

- Compare the n-gram LM to existing approaches.


WebKB 1,040 URLs Function Classification Cross Validation
DMOZ 15,000 URLs Topic Classification 50/50 Train/Test
Delicious 43,000 URLs Topic Classification 50/50 Train/Test
GVO 550 URLs Language Classification 50/50 Train/Test

Only the first two datasets were included the paper submitted to KDIR.

Previous Work

  • Feature Extraction
  • Feature Generation
  • Conventional Classifiers

Feature Extraction

Punctuation-based (terms)

http://www .carsales .com /honda -civic .php

Feature-set = {www, carsales, com, honda, civic, php}

Feature Extraction

Information Content (IC)

www wheredoesmymoneygo org

where does my money go

X = {x1, x2, ... , xi, ... , xn}

IC(X) = ∑i - log(Pr(xi))

Feature Extraction


http:// sportingclub . com

2-grams: sp, po, or, ... , om
3-grams: spo, por, ort, rti, ... , com
4-grams: spor, port, orti, ..., club, ... , bcom
5-grams: sport, porti, ortin, ..., gclub, ... , ubcom

All-Grams Storage

Delicious Dataset

Play ►

Feature Generation

  • URL Components
  • Length
  • Sequential
  • etc.



  • Naive Bayes
  • SVM
  • Maximum Entropy
  • Online Learners

The LM Classifier

For a URL, Ui, which is composed of a string of characters, it belongs to class, Cj, if the model of that class, Mj is more likely to produce that string of characters.

The LM Classifier

Pr(Mj / URL)

Using Bayes Rule:

Pr(Mj / URL) = Pr(URL / Mj) * Pr(Mj) / Pr(URL)

Training LM

Say we have two classes, C1 and C2, each is composed of one document with the following string of characters:

C1: "X Y X Z X"

C2: "Y X Z Z Z"

M1: X=3, Y=1, Z=1, Total=5
M2: X=1, Y=1, Z=3, Total=5

Training LM

Now we want to classify URL="X Z Z".

M1: X=3, Y=1, Z=1, Total=5

Pr(M1/"X Z Z")P(M1) * Pr("X Z Z"/M1)

The order of n

Listen Silent

The map of New York The new map of York

uni-gram LM(Listen) = uni-gram LM(Silent)

2-gram: {"li", "is", "st", "te", "en"}

3-gram: {"lis", "ist", "ste", "ten"}

4-gram: {"list", "iste", "sten"}


The order of n

For an alphabet of only 2 character

The possible vocabulary grows exponentially with n, 2n

Best value for n

Macro Average F1-measure (100%)

Delicious - GVO

Unseen n-Grams?

Laplace (add one) Smoothing

Pr(x) = (count(x) + 1)(Total + V)

Pr(x) = (count(x) + γ)(Total + (γ * V))

γ controls the probability mass assigned to unseen sequences.


γ: 0.00
Pr(x) = (count(x) + γ)(Total + (γ * V))


Log-likelihood Odds

Pr(Mj/Ui) = Pr(Ui/Mj) * Pr(Mj)


= Pr(Ui/Mj)Pr(Ui/⌉Mj) * Pr(Mj)Pr(⌉Mj)

LLO => log (Pr(Mj/Ui)Pr(⌉Mj/Ui))

WebKB Results

Macro Average F1-measure

WebKB: Function classification task

DMOZ Results

Macro Average F1-measure

DMOZ: Topic classification task


- For some cases the proposed method required less than 0.001% of the storage and processing power needed by the previous methods.

- The performance of the proposed method is better than the all-grams for 3 out of the 4 datasets

- For the DMOZ dataset, the performance of the proposed method is comparable to that of the all-grams

- The proposed method doesn't require feature-selection or feature-extraction


Backup Slides

GVO Results

Macro Average F1-measure

GVO: Language classification task

Delicious Results

The 5-gram LM is better than the all-grams approach.

True Positive rate is better with p-value < 0.05

The TPrates for 3-gram and 8-gram LM are marginally better with p−values of 0.132.

Fale Positive rates are not significantly different for the four classifiers.

Learning Rate

The number of misclassified URLs per each 1000 training/test instances (Delicious, γ = 1).

n-gram LM
n-gram LM/LLM

Published under Creative Commons license