Data Science


Tarek Amr


@gr33ndata

Data Science


The Sexiest Job of the 21st Century


Harvard Business Review

Data Science


No Hacking: Always waiting for someone to get them data.

No Stats: Misinterpret noise as signal.

No Domain Knowledge: Overlook business drivers.

Data Science


Index


Machine Learning

Data Science Process

Netflix


Zite



Target


Knew a Girl was Pregnant before her Father*.

Other Retails


Sainsbury's: Stock products for loyal customers.

Tesco: Targeted ads online based on customer's profile.

Target : Toys catalogs for customers with children.

Brands: Offers to customes of rival brans.

Nate Silver


Methodology*


Weighting: Recency, Sample size and Pollster rating.

Adjustment: Trendline, Party affiliation and Demographics.

Non-poll factors: Representatives and Party identification.

Regression: Election Day projection.

Social Media


Andranik Tumasjan: Predicting Elections with Twitter*.

Panagiotis Metaxas: How (Not) To Predict Elections*.

Panagiotis Metaxas: Sentiment Analysis*.

Police



Predict Crimes*

CIA



Open-source Intelligence (OSINT)*

NLP



Unstructured to Structured Data*

Data Science


In Financial Sector

Skip to Technical Slides

Fraud Detection


Credit card fraud detection

Data mining: Classify and Cluster

Churn Forecasting


It costs up to 5 times as much to make a sale to a new customer as it does to existing ones.

Classification* using cardholder details, account details, transactions, etc.

Cross Selling


Financial or other products.

Customers who bought this buy that.

Target customers for this new offer.

Risk Management


Capital Risk.

Market Risk.

Operational Risk.

Data Sources


Transactions.

Market feeds.

Customers database.

Customer Service records.

Social Media

Machine Learning


Technical Slides Ahead

Skip To Conclusion

Classification



Dataset, Features and Target Class Label.

Learning Phase



Need class labels, learn from training dataset.

Classification Phase



Classify new unlabeled instances using inferred model.

Decision Trees



ID3, C4.5 (J48), CHAID, etc.

Decision Trees



Building the tree from data.

Decision Trees



Using the tree to classify new instances.

Linear Classifiers



Logistic regression, SVM (SVM linear!?)

1-NN



Local Decision Boundaries.
Instance-based learning, lazy learner.

k-NN



More immune to noise and outliers.
Descritization and Distance measures.

Text Classification


Naive Bayes

SVM

Clustering



No labels.

Clustering



Explore the data.

Clustering Algorithms


Centroid-based, K-Means

Hierarchical clustering (Top-down or Bottom-up)

Density-based.

More Machine Learning


Regression analysis.

Association rules.

Semi-supervised learning.

Recommendation Engines.

Regression



Linear Regression, to predict values.

Data Mining Process

Business Understanding


You solve problems.

They can be Business Problems or Academic Research Problems.

Data Acquisition



Extract, Transform and Load.

Data Cleaning



Garbage in, garbage out.


Tools: Open Refine, Spreadsheets, Python, etc.

Feature Extraction


Edge detection (Images)

Latent semantic analysis (Text)

Principal component analysis (de-correlation)

Domain Knowledge

Deep Learning (FE+Model)

Feature Selection


Bias–variance tradeoff


The curse of dimensionality

Modelling



Not this!

Machine Learning*

Evaluation


Science = Hypothesis testing!



Training Dataset

Test Dataset

Holdout Dataset

CRISP



Cross Industry Standard Process for Data Mining.

Statistics


Descriptive Statistics

Predictive Statistics

Breiman's 2 Cultures



Given x, find y.

Breiman's 2 Cultures*


Data generated by a given stochastic data model.

Model parameters estimated from the data

E.g. Linear regression*, Logistic regression, etc.

Breiman's 2 Cultures*


Considers the box complex and unknown.

Algorithm predicts y from x.

E.g. K-NN*, Decision trees and Neural Networks.

Conclusion


Feel free to contact me if you have any comments about the ideas mentioned here, or if you would like me to help you using it in any of your projects.


Contact me!

Published under Creative Commons license