URL-based Web-Page Classification


Using n-gram Language Models


Tarek Amr Abdallah   &  Beatriz de La Iglesia


http://tarekamr.appspot.com/msc/presentation

tarekamr.com

Introduction

Why URL-based classification?

- Millions of Links in Emails and Social Media

- Spam and Web Filters

Issues with URL-based classification

- Very Short and Concise documents to Classify

- Can be Composed of Concatenated Words

Aim

Build a scalable and accurate model for classifying web-pages using their URLs only.

Objectives

- Study the existing approaches.

- Present the n-gram Language Model (LM) for classification.

- Highlight the different parameters for the n-gram LM.

- Validate the n-gram LM using different datasets.

- Compare the n-gram LM to existing approaches.

Datasets


WebKB 1,040 URLs Function Classification Cross Validation
DMOZ 15,000 URLs Topic Classification 50/50 Train/Test
Delicious 43,000 URLs Topic Classification 50/50 Train/Test
GVO 550 URLs Language Classification 50/50 Train/Test

Only the first two datasets were included the paper submitted to KDIR.

Previous Work


  • Feature Extraction
  • Feature Generation
  • Conventional Classifiers

Feature Extraction


Punctuation-based (terms)

http://www .carsales .com /honda -civic .php


Feature-set = {www, carsales, com, honda, civic, php}


Feature Extraction


Information Content (IC)

www wheredoesmymoneygo org

where does my money go



X = {x1, x2, ... , xi, ... , xn}

IC(X) = ∑i - log(Pr(xi))

Feature Extraction


All-grams

http:// sportingclub . com


2-grams: sp, po, or, ... , om
3-grams: spo, por, ort, rti, ... , com
4-grams: spor, port, orti, ..., club, ... , bcom
5-grams: sport, porti, ortin, ..., gclub, ... , ubcom

All-Grams Storage



Delicious Dataset

Play ►

Feature Generation


  • URL Components
  • Length
  • Sequential
  • etc.

[http][://][www.uea.ac.uk][/][~kzch.php]
[http][://][www.uea.ac.uk][/][news][/][2013][/][exams.php]

Classifiers


  • Naive Bayes
  • SVM
  • Maximum Entropy
  • Online Learners

The LM Classifier


For a URL, Ui, which is composed of a string of characters, it belongs to class, Cj, if the model of that class, Mj is more likely to produce that string of characters.

The LM Classifier


Pr(Mj / URL)

Using Bayes Rule:

Pr(Mj / URL) = Pr(URL / Mj) * Pr(Mj) / Pr(URL)


Training LM


Say we have two classes, C1 and C2, each is composed of one document with the following string of characters:


C1: "X Y X Z X"

C2: "Y X Z Z Z"


M1: X=3, Y=1, Z=1, Total=5
M2: X=1, Y=1, Z=3, Total=5

Training LM


Now we want to classify URL="X Z Z".


M1: X=3, Y=1, Z=1, Total=5


Pr(M1/"X Z Z")P(M1) * Pr("X Z Z"/M1)

The order of n


Listen Silent

The map of New York The new map of York


uni-gram LM(Listen) = uni-gram LM(Silent)


2-gram: {"li", "is", "st", "te", "en"}

3-gram: {"lis", "ist", "ste", "ten"}

4-gram: {"list", "iste", "sten"}


16

The order of n


For an alphabet of only 2 character

The possible vocabulary grows exponentially with n, 2n

Best value for n


Macro Average F1-measure (100%)


Delicious - GVO

Unseen n-Grams?


Laplace (add one) Smoothing



Pr(x) = (count(x) + 1)(Total + V)

Pr(x) = (count(x) + γ)(Total + (γ * V))


γ controls the probability mass assigned to unseen sequences.

Smoothing


γ: 0.00
Pr(x) = (count(x) + γ)(Total + (γ * V))

Relevance

Log-likelihood Odds


Pr(Mj/Ui) = Pr(Ui/Mj) * Pr(Mj)


Pr(Mj/Ui)Pr(⌉Mj/Ui)


= Pr(Ui/Mj)Pr(Ui/⌉Mj) * Pr(Mj)Pr(⌉Mj)


LLO => log (Pr(Mj/Ui)Pr(⌉Mj/Ui))

WebKB Results


Macro Average F1-measure


WebKB: Function classification task

DMOZ Results


Macro Average F1-measure


DMOZ: Topic classification task

Conclusion


- For some cases the proposed method required less than 0.001% of the storage and processing power needed by the previous methods.

- The performance of the proposed method is better than the all-grams for 3 out of the 4 datasets

- For the DMOZ dataset, the performance of the proposed method is comparable to that of the all-grams

- The proposed method doesn't require feature-selection or feature-extraction

Questions


Backup Slides


GVO Results


Macro Average F1-measure


GVO: Language classification task

Delicious Results


The 5-gram LM is better than the all-grams approach.

True Positive rate is better with p-value < 0.05

The TPrates for 3-gram and 8-gram LM are marginally better with p−values of 0.132.

Fale Positive rates are not significantly different for the four classifiers.


Learning Rate


The number of misclassified URLs per each 1000 training/test instances (Delicious, γ = 1).

n-gram LM
n-gram LM/LLM

Published under Creative Commons license