The Sexiest Job of the 21st Century
No Hacking: Always waiting for someone to get them data.
No Stats: Misinterpret noise as signal.
No Domain Knowledge: Overlook business drivers.
Knew a Girl was Pregnant before her Father*.
Sainsbury's: Stock products for loyal customers.
Tesco: Targeted ads online based on customer's profile.
Target : Toys catalogs for customers with children.
Brands: Offers to customes of rival brans.
Weighting: Recency, Sample size and Pollster rating.
Adjustment: Trendline, Party affiliation and Demographics.
Non-poll factors: Representatives and Party identification.
Regression: Election Day projection.
Andranik Tumasjan: Predicting Elections with Twitter*.
Panagiotis Metaxas: How (Not) To Predict Elections*.
Panagiotis Metaxas: Sentiment Analysis*.
Open-source Intelligence (OSINT)*
Unstructured to Structured Data*
Credit card fraud detection
Data mining: Classify and Cluster
It costs up to 5 times as much to make a sale to a new customer as it does to existing ones.
Classification* using cardholder details, account details, transactions, etc.
Financial or other products.
Customers who bought this buy that.
Target customers for this new offer.
Customer Service records.
Need class labels, learn from training dataset.
Classify new unlabeled instances using inferred model.
ID3, C4.5 (J48), CHAID, etc.
Building the tree from data.
Using the tree to classify new instances.
Logistic regression, SVM (SVM linear!?)
Local Decision Boundaries.
Instance-based learning, lazy learner.
More immune to noise and outliers.
Descritization and Distance measures.
Explore the data.
Hierarchical clustering (Top-down or Bottom-up)
You solve problems.
They can be Business Problems or Academic Research Problems.
Extract, Transform and Load.
Garbage in, garbage out.
Tools: Open Refine, Spreadsheets, Python, etc.
Edge detection (Images)
Latent semantic analysis (Text)
Principal component analysis (de-correlation)
Deep Learning (FE+Model)
The curse of dimensionality
Science = Hypothesis testing!
Given x, find y.
Data generated by a given stochastic data model.
Model parameters estimated from the data
E.g. Linear regression*, Logistic regression, etc.
Considers the box complex and unknown.
Algorithm predicts y from x.
E.g. K-NN*, Decision trees and Neural Networks.
Feel free to contact me if you have any comments about the ideas mentioned here, or if you would like me to help you using it in any of your projects.
Published under Creative Commons license