This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself.
For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task.
Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.
In today’s world, millions of web links are being shared every day in emails or on social media web sites. Thus, there is a number of contexts in which it is important to have an efficient and reliable way to classify a web-page by its Uniform Resource Locators (URLs), without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. Additionally, they can use the classification results in marketing researches to predict users’ preferences and interests. Thus, the target of this research is to be able to classify web pages using their URLs only.
The n-gram language model (LM) is proved here to be more scalable for large datasets, compared to existing approaches. As opposed to some of the the existing approaches, no feature extraction is required. The results presented here show equiv- alent classification performance to previous successful approaches, if not better. It also allows for better estimation for unseen sub-sequences in the URLs.
Caravan Insurance collected a set of 5000 records for their customers (dataset can be downloaded from this link). The dataset is composed of 85 attributes, as well as an extra label stating whether a customer purchased their mobile home policies insurance policies or not. In this technical report we try to infer a classification model based on this data, so Caravan can tailor their email campaign to contact customers who are most likely to purchase this insurance policy. Artificial Neural Networks (MLP/Weka) and Dicision Trees (J48/Weka) were used. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.
Feature selection plays an important role in the data mining process. It is needed to deal with the excessive number of features, which can become a computational burden on the learning algorithms. It is also necessary, even when computational resources are not scarce, since it improves the accuracy of the machine learning tasks, as we will see in the upcoming sections. In this review, we discuss the different feature selection approaches, and the relation between them and the various machine learning algorithms. This report tries to compare between the existing feature selection approaches. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.
Time-series (or sequential) data are every where. They are important in stock market analysis, economics, sales forecasting, and the study of natural phenomena such as temperature and wind speed. The growing size of such data, as well as its variable statistical nature, make it a challenging problem for data mining algorithms to predict, classify. In this report, I focus on time-series data classiﬁcation by shedding the light on the researches done in this area. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.
Do You Speak London (dysl) is a command line tool and python library for natural language identification, also known as LangID, using character-based n-Gram Language Model. Currently pre-packaged with training data for 4 languages; English, Arabic, Spanish and Portuguese. However, you can simply re-train it on your own dataset. This paper explains the basic architecture of the library as well as its theoretical background.
With the rise of online business, consumers nowadays are not only able do their shopping online, but also they can leave reviews on their purchased products for other potential users to see. There is also potential for vendors trying to influence their consumers' decisions, by injecting deceptive product reviews online. Many efforts have been done recently to develop algorithms to detect such deceptive opinion spam. Throughout this report, we will try to shed the light on some of the work done in this area. I wrote this report as part of my MSc. degree in data mining program in the University of East Anglia.
Usability is defined by ISO/IEC 9241 as the extent to which software products satisfy the users' needs in an effective and efficient manner. In this study we introduce the various sets of usability evaluation and design guidelines available today. Then, we apply a subset of those evaluation guidelines to three accommodation booking websites, and attempt to offer an alternative design that covers the deficiencies found in our evaluation. I wrote this report as part of my MSc. degree in the University of East Anglia.
This report tries to shed the light on the ongoing trends in mobile phones usage. It summarizes how people are using their phones, whether offline or online. What content do they access online and how do they access it. Also, if there is a relationship between mobile usage and demographic differences. Finally, it reports on how businesses are responding to these trends by adapting their online presence. I wrote this report as part of my MSc. degree in the University of East Anglia.
ElasticSearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Nevertheless, there are endless uses of Elasticsearch beyond that, and here is one of them.
You can do a lot of things with Google Drive via scripts. In fact, scripts can give your spreadsheets super powers. However, I will focus here on a very simple scripting scenario.
A very quick introduction to Python's NumPy framework. NumPy is a Python list on steroids. You can use it to create multidimensional arrays and matrices.
Well, I had an account on github since 2009, but I only started using few years later. On the one hand, that was because I was a very lazy programmer, but on the other hand, it was because git used to confuse me. I wrote this tutorial for myself to keep notes of all git's basic commands.
For a full list of tutorials and tech. posts, click here.