Data Wrangling - Australia Assignments

INTRODUCTION TO MACHINE
LEARNING
Data Wrangling
LEARNING OBJECTIVES / LECTURE OVERVIEW
➤ What is machine learning
➤ Supervised learning
➤ Unsupervised learning
➤ Applications
➤ Decision Trees
MATHEMATICAL NOTATION
➤ I have tried to keep mathematical notation to minimum, in
order to focus on the theory and understanding of the
algorithms.
➤ Also the provided code abstracts most of the mathematical
notations.
➤ However, you should know what the following are: vectors,
matrices, probabilities.
➤ Let me know if you would like more information on these!
Intro of Supervised
Learning
WHAT IS MACHINE LEARNING
➤ Machine learning is a field of artificial intelligence that uses statistical
techniques to give computer systems the ability to “learn” from data, without
being explicitly programmed
➤ Example – classification: What class does the examples fall into?
➤ Potential classes: animal, plant, building
MACHINE LEARNING OVERVIEW
➤ Typically the algorithm has a (large) number of parameters (or
weights) whose values are learnt from the data
➤ Can be applied in situations where it is very challenging (=
impossible) to define rules by hand, e.g.:
Face detection
Speech recognition
Stock market price prediction
EXAMPLE 1: HAND-WRITTEN DIGIT RECOGNITION
➤ Images 28×28 pixels
➤ Each input image is represented as a vector X of 784 length
➤ Learn a classifier f(x) such that f : x → {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
HOW TO ADDRESS HAND-WRITTEN DIGIT RECOGNITION
➤ This can be framed as a
supervised classification
problem
➤ Start with training data, e.g.
6000 examples of each digit
➤ Currently used for recognition
of credit card numbers and
names
➤ Example: https://
www.pyimagesearch.com/
2017/07/17/credit-card-ocr-withopencv-and-python/
Image from macrumors.com
EXAMPLE 2: IMAGE FACE CLASSIFICATION IN THREE CLASSES
➤ Assume that you need to classify an image into three classes:
non-face
frontal-face
profile-face
➤ Training data with images that fall into these three categories
➤ Again this is a supervised learning problem
SUPERVISED LEARNING
➤ Supervised learning is the machine learning task of learning a
function that maps inputs to outputs based on example inputoutput pairs.
➤ Essentially, we provide an algorithm with a lot of examples, so
next time it encounters a new example, it can infer its class.
➤ (very high level) A supervised learning framework has two
stages: training and inference.
AT TRAINING
➤ We provide the algorithm with a dataset/corpus: i.e. lots
(thousands) of examples (inputs-output pairs) for our task.
Example Input Example Output/Label
cat
lion
tiger
mapping
mapping
mapping
At training, the
algorithms learn to map
inputs to outputs based
on the examples in the
dataset.
AT INFERENCE/PREDICTION STAGE
➤ The algorithm uses its learning to predict at which class a new
example belongs to.
cat, lion, tiger?
INPUTS/OUTPUTS NOTATION
➤ A classifier will require inputs (e.g. images) and outputs (e.g.
classes or labels).
➤ In machine learning inputs are denoted by X
➤ The set of N inputs is then considered as X = {x1, x2, …
xN}.
➤ The corresponding set of N outputs is denoted by Y = {y1,
y2, …,yN}.
EXAMPLE 3: SPAM DETECTION
➤ This is a classification problem
➤ Task is to classify email into spam/non-spam
➤ Input data X can be any variables such as word count, e.g. of invest,
banker, … or even short text phrases such as “business transaction”.
➤ Requires a learning system as fraudulent approaches keeps
innovating (very important nowadays, see Ransomware).
EXAMPLE 4: STOCK MARKET PREDICTION
➤ Task is to predict stock price at future date
➤ This is a regression task, as the output is a continuous number
EXAMPLE 5: MACHINE TRANSLATION
EXAMPLE 6: TEXT/OPINION/REVIEWS CLASSIFICATION
SENTIMENT ANALYSIS
SENTIMENT ANALYSIS
EXAMPLE 7: RECOMMENDATION SYSTEMS
TWO SUPERVISED LEARNING APPROACHES
➤ Classification: predict classes, e.g. face recognition
➤ Regression: predict continuous values, e.g. prices.
SUPERVISED LEARNING
slide from Zisserman
Unsupervised Learning
UNSUPERVISED LEARNING
➤ Unsupervised learning is the task of learning from data that is
not associated with outputs/labels.
EXAMPLE CLUSTERING: GROUPING SIMILAR THINGS TOGETHER
TYPES OF UNSUPERVISED LEARNING
➤ Clustering, i.e. grouping of objects
➤ Anomaly Detection, i.e. the identification of rare items or
events, e.g. fraud detection, medical problems etc.
➤ Latent Variable models, such as topic modelling
Reinforcement Learning
Reinforcement Learning Framework
Applications include interactive games, such as chess,
dialogue systems, robotics, etc.
Supervised learning
example: decision trees
DECISION TREES
➤ Decision Trees are tree-like structures of if/else relationships that can be used for decision making.
➤ Binary Classification, i.e. two potential classes, for instance, yes or no.
➤ There is also multi-class classification, for instance, classification of ten objects.
DECISION TREE FOR PLAY GOLF
Input data
Leafs (Yes/No)
Root (outlook)
VARIABLES
➤ Outlook, Temp, Humidity, and Windy are called predictors or
variables or features, i.e. their values help the algorithms
make decisions. They are the inputs to a trained classifier and
they denoted with X.
➤ Play Golf is the predicted variable (target), or class/label,
i.e. what we try to learn (denoted with Y). It is the output of a
trained classifier.
DECISION TREES IN NUTSHELL
➤ Decision trees build classification models in the form of a tree
structure.
➤ They break down a dataset into smaller subsets while at the
same time an associated decision tree is incrementally
developed.
➤ The final result is a tree with decision nodes and leaf nodes.
➤ Decision trees can handle both categorical and numerical
data.
ID3 ALGORITHM
➤ The core algorithm for building decision trees called ID3 by Quinlan
(original paper: http://hunch.net/~coms-4771/quinlan.pdf) which
employs a top-down, greedy search through the space of possible
branches with no backtracking.
➤ ID3 uses Entropy and Information Gain to construct a decision tree.
➤ There are other types of decision trees that create nodes and leaves in a
different way, such as ZeroR, OneR and naive Bayes.
ENTROPY
➤ A decision tree is built top-down from a root node and involves partitioning the data
into subsets that contain instances with similar values (homogenous).
➤ The ID3 algorithm uses entropy to calculate the homogeneity of the data.
➤ If the data is completely homogeneous the entropy is zero and if the data is equally
divided, the entropy is equal to one.
➤ To build a decision tree we need two types of entropy (next slides).
1. ENTROPY USING THE FREQUENCY TABLE OF ONE ATTRIBUTE
Where p_i stands for each class, e.g. (p_yes, p_no) and S stands for the
predicted variable
*p_no = 5/14 = 0.36, p_yes = 9/14=0.65
*
2. ENTROPY USING THE FREQUENCY TABLE OF TWO ATTRIBUTES
Where T stands for the predicted variables
(PlayGolf), X stands for the predictors (e.g.
outlook), c stands for the observations (e.g.
sunny, etc.)
INFORMATION GAIN
➤ The information gain of an attribute is calculated by deducting the
entropy for one attribute from the overall dataset entropy.
➤ e.g. IG(Outlook) = E(Play Golf) – E(Outlook)
➤ Constructing a decision tree is all about finding attribute that
returns the highest information gain (i.e., the most homogeneous
branches).
STEP 1: CALCULATE ENTROPY OF THE TARGET
STEP 2:
➤ The dataset is then split on the different attributes.
➤ The entropy for each branch is calculated.
➤ Then it is added proportionally, to get total entropy for the split.
➤ The resulting entropy is subtracted from the entropy before the split.
➤ The result is the Information Gain, or decrease in entropy.
STEP 3
In the next step, the algorithm chooses the attribute with the largest information
gain as the decision node, dividing the dataset by its branches and repeating the
same process on every branch.
STEP 4A
Step 4a: A branch with entropy of 0 is a leaf node.
STEP 4B
A branch with entropy more than 0 needs further splitting.
STEP 5
The ID3 algorithm is run recursively on the non-leaf branches, until
all data is classified.
DECISION TREES TO DECISION RULES
A decision tree can easily be transformed to a set of rules by mapping from the
root node to the leaf nodes one by one.
In practice
VARIABLES
➤ Categorical (Rainy, Sunny etc.)
➤ Numeric (either continuous or discrete)
➤ Binary
➤ When they are none of the above, for instance when we have to deal
with text, such as sentences, images or other long sequences, we require
more advanced approaches (e.g. natural language processing).
DISCRETISATION
➤ In many real-world problems, the variable values might be of a
wide range, such as age.
➤ It makes sense to turn these values into more useful information,
e.g. age ranges, 18-30, 31-40, etc.
➤ Depending on the problem you might want to even turn them into
binary.
Evaluation (of classification
models)
TRAINING-TEST SPLIT
➤ In practice, you will want to know whether your decision tree
performs well.
➤ In order to do so, you will need to split your dataset into a
training and test set.
➤ So that you will use the training set to create the tree (more
generally to train your classifier).
➤ And then the testing set to evaluate it (i.e. does it do the
predictions that is supposed to be doing?)
EVALUATION OF MACHINE LEARNING ALGORITHMS
➤ Accuracy
➤ Confusion Matrix
➤ Precision
➤ Recall
➤ F-score
EVALUATION METRICS EXPLAINED: ACCURACY
➤ Example: Consider a classifier that predicts
whether to offer someone a loan (positive
prediction) or not (negative prediction)
➤ Accuracy: percentage of correctly classified
instances.
For example, if your test set had 50
examples of customers, where 40 of
them were classified correctly (as to be
offered a loan), then the accuracy would
be 40/50 = 0.8 or 80%.
EVALUATION METRICS EXPLAINED: CONFUSION MATRIX
• true positives (TP): These are cases in which the classifier predicted that a customer would get a
loan and indeed did.
• true negatives (TN): These are the case where the classifier predicted that a customer would not
get a loan and indeed did not.
• false positives (FP): Cases where the classifier predicted that the customer would get a loan and
did not (Type I error).
• false negatives (FN): Cases where the classifier predicted that the customer would not het a loan
but actually did. (Type II error).

Actually Given loan	Actually not given loan
Predicted: Given Loan	TP: 20	FP: 10	30 (Total Predicted Given Loan)
Predicted: Not Given	FN: 5	TN: 15	20 (Total Predicted Not Given Loan)
25	25

EVALUATION METRICS EXPLAINED: PRECISION
Precision: the ratio tp / (tp + fp) where tp is the number of true positives
and fp the number of false positives. The precision is intuitively the ability of the
classifier not to label as positive a sample that is negative. The best value is 1 and the
worst value is 0.

Actually Given loan	Actually not given loan
Predicted: Given Loan	TP: 20	FP: 10	30 (Total Predicted Given Loan)
Predicted: Not Given	FN: 5	TN: 15	20 (Total Predicted Not Given Loan)
25	25
Precision: 20/(20+ 10) = 0.66

EVALUATION METRICS EXPLAINED: RECALL
Recall: the ratio tp / (tp + fn) where tp is the number of true positives
and fn the number of false negatives. The recall is intuitively the ability of the classifier
to find all the positive samples. The best value is 1 and the worst value is 0.

Actually Given loan	Actually not given loan
Predicted: Given Loan	TP: 20	FP: 10	30 (Total Predicted Given Loan)
Predicted: Not Given	FN: 5	TN: 15	20 (Total Predicted Not Given Loan)
25	25
Recall: 20/(20+5) = 0.8

EVALUATION METRICS EXPLAINED: F1-SCORE
F1-score or F-score: tthe weighted sum of the precision and recall, where an F1 score
reaches its best value at 1 and worst score at 0. The relative contribution of precision
and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)

Actuall y Given loan	Actually not given loan
Predicted: Given Loan	TP: 20	FP: 10	30 (Total Predicted Given Loan)
Predicted: Not Given	FN: 5	TN: 15	20 (Total Predicted Not Given Loan)
25	25
F-score 2(0.660.8)/(0.66+0.8) = 0.72

MORE ON EVALUATION (OF TRADITIONAL ML ALGORITHMS)
➤ Two ways of evaluation:
Splitting the data into training and test set (e.g. 70%/30%)
and report results on these metrics.
Using k-fold validation and report all metrics by averaging
them. Most common version is 10-fold.
➤ Which one do you think is best?
Questions?

Tags: Assignment Help for Students, Assignment Help Free, Assignment Help Online Free, Assignment Help Websites, assignmenthelp, AssignmentHelpOnline, BestAssignmentHelp, myassignmenthelp, OnlineAssignmentHelp, Student Assignment Help, University Assignment Help