Tarek Amr's Homepage

Data Mining

URL-Based Web Page Classification Using n-Gram Language Models (Published Paper / KDIR14)

This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself.

For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task.

Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.

Paper published in KDIR 2014


URL-Based Web Page Classification Using n-Gram Language Models (MSc. Dissertation / University of East Anglia)

In today’s world, millions of web links are being shared every day in emails or on social media web sites. Thus, there is a number of contexts in which it is important to have an efficient and reliable way to classify a web-page by its Uniform Resource Locators (URLs), without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. Additionally, they can use the classification results in marketing researches to predict users’ preferences and interests. Thus, the target of this research is to be able to classify web pages using their URLs only.

The n-gram language model (LM) is proved here to be more scalable for large datasets, compared to existing approaches. As opposed to some of the the existing approaches, no feature extraction is required. The results presented here show equiv- alent classification performance to previous successful approaches, if not better. It also allows for better estimation for unseen sub-sequences in the URLs.

Aug. 2013 / Dissertation Dissertation Slides (Interactive) Dissertation Slides (PDF)


Mining Caravan Insurance Database

Caravan Insurance collected a set of 5000 records for their customers (dataset can be downloaded from this link). The dataset is composed of 85 attributes, as well as an extra label stating whether a customer purchased their mobile home policies insurance policies or not. In this technical report we try to infer a classification model based on this data, so Caravan can tailor their email campaign to contact customers who are most likely to purchase this insurance policy. Artificial Neural Networks (MLP/Weka) and Dicision Trees (J48/Weka) were used. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.

Mar. 2013 / Unpublished


Survey on Feature Selection

Feature selection plays an important role in the data mining process. It is needed to deal with the excessive number of features, which can become a computational burden on the learning algorithms. It is also necessary, even when computational resources are not scarce, since it improves the accuracy of the machine learning tasks, as we will see in the upcoming sections. In this review, we discuss the different feature selection approaches, and the relation between them and the various machine learning algorithms. This report tries to compare between the existing feature selection approaches. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.

Feb. 2013 / Unpublished


Survey on Time-Series Data Classification

Time-series (or sequential) data are every where. They are important in stock market analysis, economics, sales forecasting, and the study of natural phenomena such as temperature and wind speed. The growing size of such data, as well as its variable statistical nature, make it a challenging problem for data mining algorithms to predict, classify. In this report, I focus on time-series data classification by shedding the light on the researches done in this area. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.

Jan. 2013 / Unpublished


Information Retrieval

Language Identification: Do You Speak London?

Do You Speak London (dysl) is a command line tool and python library for natural language identification, also known as LangID, using character-based n-Gram Language Model. Currently pre-packaged with training data for 4 languages; English, Arabic, Spanish and Portuguese. However, you can simply re-train it on your own dataset. This paper explains the basic architecture of the library as well as its theoretical background.

Mar. 2014 / Unpublished DYSL


Opinion Spam: Issues and Techniques

With the rise of online business, consumers nowadays are not only able do their shopping online, but also they can leave reviews on their purchased products for other potential users to see. There is also potential for vendors trying to influence their consumers' decisions, by injecting deceptive product reviews online. Many efforts have been done recently to develop algorithms to detect such deceptive opinion spam. Throughout this report, we will try to shed the light on some of the work done in this area. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.

Oct. 2012 / Unpublished IRLib


Human-Computer Interaction

Usability Evaluation of Travel Websites

Usability is defined by ISO/IEC 9241 as the extent to which software products satisfy the users' needs in an effective and efficient manner. In this study we introduce the various sets of usability evaluation and design guidelines available today. Then, we apply a subset of those evaluation guidelines to three accommodation booking websites, and attempt to offer an alternative design that covers the deficiencies found in our evaluation. I wrote this report as part of my MSc. degree in the University of East Anglia.

Nov. 2012 / Unpublished


Mobile Observations

This report tries to shed the light on the ongoing trends in mobile phones usage. It summarizes how people are using their phones, whether offline or online. What content do they access online and how do they access it. Also, if there is a relationship between mobile usage and demographic differences. Finally, it reports on how businesses are responding to these trends by adapting their online presence. I wrote this report as part of my MSc. degree in the University of East Anglia.

Oct. 2012 / Unpublished Online Slides


Tutorials

How to build an Interactive Dictionary using ElasticSearch

ElasticSearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Nevertheless, there are endless uses of Elasticsearch beyond that, and here is one of them.

How to build an Interactive Dictionary using ElasticSearch


Scripting in Google Spreadsheet

You can do a lot of things with Google Drive via scripts. In fact, scripts can give your spreadsheets super powers. However, I will focus here on a very simple scripting scenario.

Scripting in Google Spreadsheet


Pie and Donut Charts using D3.js

D3.js is a JavaScript library that is widely used in data visualisation and animation. The power of d3.js and its flexibility, comes at the expense of its steep learning curve. There are some libraries built on top of it that provide numerous off-the-shelf charts in order to make the users’ life easier, however, learning to work with d3.js is essential sometimes, especially when you need to create sophisticated and custom visualisations.

Pie and Donut Charts using D3.js


A Quick Intro. to NumPy

A very quick introduction to Python's NumPy framework. NumPy is a Python list on steroids. You can use it to create multidimensional arrays and matrices.

A Quick Intro. to NumPy


A Quick Intro. to jQuery

A very quick introduction to jQuery framework. jQuery is a cross-platform JavaScript library designed to simplify the client-side scripting of HTML.

A Quick Intro. to jQuery


Git for Dummies Like Myself

Well, I had an account on github since 2009, but I only started using few years later. On the one hand, that was because I was a very lazy programmer, but on the other hand, it was because git used to confuse me. I wrote this tutorial for myself to keep notes of all git's basic commands.

Git for Dummies like Myself


More Tutorials

For a full list of tutorials and tech. posts, click here.