Data Science

Tarek Amr


Data Science

The Sexiest Job of the 21st Century

Harvard Business Review

Data Science

No Hacking: Always waiting for someone to get them data.

No Stats: Misinterpret noise as signal.

No Domain Knowledge: Overlook business drivers.

Data Science


Machine Learning

Data Science Process




Knew a Girl was Pregnant before her Father*.

Other Retails

Sainsbury's: Stock products for loyal customers.

Tesco: Targeted ads online based on customer's profile.

Target : Toys catalogs for customers with children.

Brands: Offers to customes of rival brans.

Nate Silver


Weighting: Recency, Sample size and Pollster rating.

Adjustment: Trendline, Party affiliation and Demographics.

Non-poll factors: Representatives and Party identification.

Regression: Election Day projection.

Social Media

Andranik Tumasjan: Predicting Elections with Twitter*.

Panagiotis Metaxas: How (Not) To Predict Elections*.

Panagiotis Metaxas: Sentiment Analysis*.


Predict Crimes*


Open-source Intelligence (OSINT)*


Unstructured to Structured Data*

Data Science

In Financial Sector

Skip to Technical Slides

Fraud Detection

Credit card fraud detection

Data mining: Classify and Cluster

Churn Forecasting

It costs up to 5 times as much to make a sale to a new customer as it does to existing ones.

Classification* using cardholder details, account details, transactions, etc.

Cross Selling

Financial or other products.

Customers who bought this buy that.

Target customers for this new offer.

Risk Management

Capital Risk.

Market Risk.

Operational Risk.

Data Sources


Market feeds.

Customers database.

Customer Service records.

Social Media

Machine Learning

Technical Slides Ahead

Skip To Conclusion


Dataset, Features and Target Class Label.

Learning Phase

Need class labels, learn from training dataset.

Classification Phase

Classify new unlabeled instances using inferred model.

Decision Trees

ID3, C4.5 (J48), CHAID, etc.

Decision Trees

Building the tree from data.

Decision Trees

Using the tree to classify new instances.

Linear Classifiers

Logistic regression, SVM (SVM linear!?)


Local Decision Boundaries.
Instance-based learning, lazy learner.


More immune to noise and outliers.
Descritization and Distance measures.

Text Classification

Naive Bayes



No labels.


Explore the data.

Clustering Algorithms

Centroid-based, K-Means

Hierarchical clustering (Top-down or Bottom-up)


More Machine Learning

Regression analysis.

Association rules.

Semi-supervised learning.

Recommendation Engines.


Linear Regression, to predict values.

Data Mining Process

Business Understanding

You solve problems.

They can be Business Problems or Academic Research Problems.

Data Acquisition

Extract, Transform and Load.

Data Cleaning

Garbage in, garbage out.

Tools: Open Refine, Spreadsheets, Python, etc.

Feature Extraction

Edge detection (Images)

Latent semantic analysis (Text)

Principal component analysis (de-correlation)

Domain Knowledge

Deep Learning (FE+Model)

Feature Selection

Bias–variance tradeoff

The curse of dimensionality


Not this!

Machine Learning*


Science = Hypothesis testing!

Training Dataset

Test Dataset

Holdout Dataset


Cross Industry Standard Process for Data Mining.


Descriptive Statistics

Predictive Statistics

Breiman's 2 Cultures

Given x, find y.

Breiman's 2 Cultures*

Data generated by a given stochastic data model.

Model parameters estimated from the data

E.g. Linear regression*, Logistic regression, etc.

Breiman's 2 Cultures*

Considers the box complex and unknown.

Algorithm predicts y from x.

E.g. K-NN*, Decision trees and Neural Networks.


Feel free to contact me if you have any comments about the ideas mentioned here, or if you would like me to help you using it in any of your projects.

Contact me!

Published under Creative Commons license