Predictive Analysis in Python
(Regression)
Dr Yanchao Yu
Lecturer, Edinburgh Napier University
Email: [email protected]
Homepage: https://yanchao-yu.netlify.app/
Date: Jan 2022
Data Wrangling
SET11121-02
What shall we learn today?
1. What is Predictive Analysis?
2. The Predictive Analysis Workflow
3. Regression Methods
a. Linear vs non-Linear Methods
4. Data Visualisation (Code Level)
What is Predictive Analysis?
What is Predictive Analysis?
Predictive Analysis is a process of
using data, statistical algorithms and
machine learning techniques to
identify the likelihood of future
outcomes based on historical data.
The goal is to go beyond knowing
what has happened to providing a
best assessment of what will happen
in the future.
What is Predictive Analysis? (cont.)
Who’s using Predictive Analysis?
The Predictive Analysis Workflow
How does Predictive Analysis work?
How does Predictive Analysis work?
Define the problem to
solve.
● What do you want to
know about the future
based on the past?
● What do you want to
understand and
predict?.
● What decisions will be
driven by the
insights?
● What actions will be
taken?
How does Predictive Analysis work?
Define the problem to
solve.
● What do you want to
know about the future
based on the past?
● What do you want to
understand and
predict?.
● What decisions will be
driven by the
insights?
● What actions will be
taken?
Data Preparation &
Interpreter.
● Cleanse and prep
the data for analysis
● Understanding
which info we have
available in the
dataset
How does Predictive Analysis work?
Define the problem to
solve.
● What do you want to
know about the future
based on the past?
● What do you want to
understand and
predict?.
● What decisions will be
driven by the
insights?
● What actions will be
taken?
Data Preparation &
Interpreter.
● Cleanse and prep
the data for analysis
● Understanding
which info we have
available in the
dataset
Prediction Making.
● Build the prediction
model based on the
data and the
problem to solve
How does Predictive Analysis work?
Define the problem to
solve.
● What do you want to
know about the future
based on the past?
● What do you want to
understand and
predict?.
● What decisions will be
driven by the
insights?
● What actions will be
taken?
Data Preparation &
Interpreter.
● Cleanse and prep
the data for analysis
● Understanding
which info we have
available in the
dataset
Prediction Making.
● Build the prediction
model based on the
data and the
problem to solve
How did we do?
● Test/evaluate the
built prediction
model using
automatic
evaluation metrics
● Examining the
Effects of Each
Variable
Step 1: Define the problem to solve
● Real World Applications have domain-specific problems, for
example:
○ A chief financial officer, using marketing costs to predict
sales revenue.
○ A doctor wanting to know the probability of diagnosis based
on symptoms.
○ Lecturers wanting to know how much study time is required
for students to ace an exam.
○ A house buyer, who wants to establish how the house size
and number of rooms will affect the price the house would
sale for.
Step 2: Data Preparation & Interpreter
● Before diving into creating any model, We need to
understand the nature and characteristics of the data
being analyzed.
○ Import datasets (using Pandas)
○ Perform some exploratory analysis
○ Detect patterns,
○ Check the data structure,
○ Plot simple statistics on the features (the columns).
See example in Code!
Step 3: Prediction Making –
Regression Analysis
● A scatter plot can be used to show the relationship
between two variables
● Correlation analysis is used to measure the
strength of the association (linear relationship)
between two variables
○ Correlation is only concerned with strength of
the relationship
○ No causal effect is implied with correlation
Regression Analysis
Y
X
Y
X
Y Y
X X
Linear relationships Curvilinear relationships
Type of Relationships
Y
X |
Y
X
Strong relationships Weak relationships
X
X |
Y
Y |
Y
Type of Relationships (cont.)
Y
X |
Y
X
No relationship
Simple Linear Regression Model
▪ Only one independent variable, X
▪ Relationship between X and Y is described by a linear
function
▪ Changes in Y are assumed to be related to changes in X
Simple Linear Regression Model
Linear component
Population
Y intercept Population Slope
Coefficient
Random
Error term
Dependent
Variable
Independent
Variable
Random Error
component
▪ Only one independent variable, X
▪ Relationship between X and Y is described by a linear
function
▪ Changes in Y are assumed to be related to changes in X
Simple Linear Regression Model
Random Error
for this X
i value
Y
X
Observed Value
of Y for X
i
Predicted Value
of Y for X
i
X
i
Slope = β1
Intercept = β0
ε
i
Simple Linear Regression Model
The simple linear regression equation provides an
estimate of the population regression line
Estimate of
the
regression
intercept
Estimate of the
regression
slope
Estimated (or
predicted) Y
value for
observation i
Value of X for
observation i
Example I
● A real estate agent wishes to examine the relationship
between the selling price of a home and its size (measured
in square feet)
▪ A random sample of 10 houses is selected
▪ Dependent variable (Y) = house price in $1000s
▪ Independent variable (X) = square feet
Example I
House Price in $1000s (Y) |
Square Feet (X) |
245 | 1400 |
312 | 1600 |
279 | 1700 |
308 | 1875 |
199 | 1100 |
219 | 1550 |
405 | 2350 |
324 | 2450 |
319 | 1425 |
255 | 1700 |
Example I
House price model: Scatter Plot
Example I
House price model: Scatter Plot and
Prediction Line
Slope
= 0.10977
Intercept
= 98.248
Polynomial Linear Regression
What happens if we know that our data is correlated,
but the relationship doesn’t look linear?
Polynomial Linear Regression
What happens if we know that our data is correlated,
but the relationship doesn’t look linear?
● Depending on what the data looks like, we can do a polynomial
regression on the data to fit a polynomial equation to it
Polynomial Linear Regression
Try to use the polynomial regression to fit a polynomial
line so that we can achieve a minimum error or
minimum cost function.
Y=θo + θ₁X + θ₂X² + …
+ θₘXᵐ + residual error
The general equation of a
polynomial regression:
Polynomial Linear Regression
Advantages:
● Polynomial provides the best approximation of the relationship
between the dependent and independent variable.
● A Broad range of function can be fit under it.
● Polynomial basically fits a wide range of curvature.
Disadvantages:
● The presence of one or two outliers in the data can seriously affect
the results of the nonlinear analysis.
● These are too sensitive to the outliers.
● In addition, there are unfortunately fewer model validation tools
for the detection of outliers in nonlinear regression than there are
for linear regression.
How it works with Sklearn?
See Example in Code
Evaluating the Models
There are two common metrics applied to Regression
Methods Evaluation:
● Mean Squared Error (MSE)
● R-Squared (R2)
Evaluating the Models — MSE
Mean Squared Error (MSE): the average of sum of squared difference
between actual value and the predicted or estimated value.
Also called mean squared deviation (MSD)
Evaluating the Models — MSE
The value of MSE is always positive or greater than zero. A value close
to zero will represent better quality of the estimator / predictor
(regression model).
Evaluating the Models — MSE
R-Squared: the ratio of Sum of Squares Regression (SSR) and Sum of
Squares Total (SST).
● Sum of Squares Regression (SSR) is amount of variance explained by
the regression line.
● Sum of Squares Total (SST) is the sum of variance of points Yi to the
mean of Ys.
Evaluating the Models — R2
R-squared value is used to measure the goodness of fit. Greater the
value of R-Squared, better is the regression model.
MSE vs R2
Similarity:
● A type of metrics for evaluating the performance of the
regression models
○ especially statistical model such as linear regression
model.
Difference:
● MSE gets pronounced based on whether the data is scaled
or not.This is where R-Squared comes to the rescue.
● R-squared represents the fraction of variance of response
variable captured by the regression model rather than
● MSE captures the residual error.
Calculating MSE & R2 using Sklearn
Here is the example of
calculating MSE and
R2:
● Embedded
functions in Sklearn
package.
Data Visualisation in Python
Python Packages
Matplotlib is a low level graph plotting library in python that
serves as a visualization utility.
● Created by John D. Hunter.
● Open source and we can use it freely.
● Written in python
https://github.com/matplotlib/matplotlib
Seaborn is a Python data visualization library based on
matplotlib.
▪ High-level interface for drawing attractive and informative
statistical graphics.
https://seaborn.pydata.org/