Predictive Analytics

135 views 9:56 am 0 Comments May 12, 2023

BUS5PA Predictive Analytics – 2023
Page
1 of 3
BUS5PA Predictive Analytics – Semester 1, 2023
Assignment 2: Building and Evaluating Predictive Models using SAS Enterprise Miner
Release Date: 14
th April 2023
Due Date:
Before 7th May Sunday 11.55 pm – Individual Report
Weight:
40%
Format of Submission: A report in word/pdf version in electronic form, which contains less than or equal
to 20 pages should be submitted prior to the deadline (Font size should be Times
roman 12).
Objective:
a) Demonstrate knowledge of building different types of predictive models using SAS Enterprise
Miner
b) Demonstrate skill and knowledge in applying predictive models in real-life predictive analytics task
c) Relate theoretical knowledge of predictive models and best practices to application scenarios.
Business Case – Predictive Model for Property Price Prediction
A real estate company in Melbourne is in the process of updating their property (housing) price assessment
method and the management of the company wants to build a property price estimation system to help sellers
to sell their properties at the best price.
The company management is very keen to trial predictive modeling for this task and has gathered the
historical property sales dataset. The dataset contains 18 variables describing previously sold properties. The
attributes include the selling price of properties, year the property is built, year the property is sold, number
of bedrooms, number of bathrooms, number of car spots, etc. The list of attributes and their descriptions are
given below (a more detailed description can be found in data_description.txt).

Variable Description
id Unique Id of the record
Type Property type
Price Selling price of the property
Method Way of selling
MonthSold Month sold
YearSold Year sold
Distance Distance from CBD in Kilometres
SuburbSafety Safety rating of the suburb based on crime rate
Bedroom Number of bedrooms
Bathroom Number of Bathrooms
Car Number of carspots
Landsize Land Size in Metres
BuildingArea Building Size in Metres
YearBuilt Year the house was built
CouncilRating Rating of the Governing council
Lattitude Self explanatory
Longtitude Self explanatory
Regionname General Region (West, North West, North, North east …etc)
Propertycount Number of properties that exist in the suburb

BUS5PA Predictive Analytics – 2023
Page
2 of 3
The management of real estate company is considering you as an external consulting group to outsource the
task to develop a reliable predictive model to predict the selling price of the properties, using the
aforementioned historical dataset. You are required to build different predictive models, compare and
contrast which is the best model for the selected dataset. You are also provided with a data set with new
properties about to be listed, for which you have to predict the house prices (scoring dataset).
Q1. Setting up the project and exploratory analysis (10%)
Needs to provide a screen shot as evidence for each subsection of Q1
a. Create a new project and create a data source based on the given datasets. Set Price as the role of
Target and make sure the Role and Level assigned to each variable is correct.
b. Carry out a data exploration by using a StatExplore Node. Explain your findings with regard to
your property dataset.
c. Create a Data Partition with 70% of the data for training and 30% for validation.
Q2. Decision tree-based modeling and analysis (25%)
Carry out the following modeling tasks for the selected property value dataset.
a. Create two Decision Tree models based on two-way and three-way splits to create the two separate
decision tree models. Provide the relevant diagrams of the Decision trees.
For each decision tree,

I.
II.
III.
How many leaves are in the optimal tree?
Which variable was used for the first split?
What were the competing splits for this first split?

b. Which of the decision tree models appears to be better? Justify your answer.
c. Refer to the selected decision tree model in part (b) and
I. Identify two leaf nodes which have good predictive performances and two leaf nodes with
poor predictive performances.
II. Provide justifications for your selections.
III. Write down the rules for the pathways leading up to each selected leaf node.
Q3. Regression-based modeling and analysis (25%)
a.
In preparation for regression, is any missing values imputation needed? If yes, should you do this
imputation before generating the decision tree models? Why or why not?
b. Use an Impute node connected to Data Partition node to handle missing values. Which variables
have been imputed?
c. Are there any ordinal variables? Use the Replacement node to assign relevant values.
d. Conduct data exploration to select the best variables for the model with Variable Clustering node.
Describe and justify how you ascertained the best variables to the model.
e. Create a Regression model using the set of variables you identified as suitable in part (d). You can
choose the stepwise selection and use validation error as the selection criterion.
f. Run the Regression node and view the results.
I. Which variables are included in the final model? Explain what this means to the real estate
company (very briefly).

BUS5PA Predictive Analytics – 2023
Page
3 of 3
II. What is the validation of Average Square Error (ASE) (or Mean Square error (MSE))?
What does this mean in a predictive model?
4. Model Comparison and Scoring (25%)
a.
Use the model comparison to compare and contrast the results from the decision trees and
regression-based analysis.
Provide a summary table for comparison. Describe and justify how
you ascertained the better model.
b. Would it have been sufficient to use only one modeling technique (decision tree or regression)?
Provide justifications for your answer. Use the outcome of 4a solutions.
c. Use the scoring data sets to score using the best predictive model. Explain the output using plots.
5. Extending current knowledge with additional reading – SEMMA (15%)
Solution for Q5, should not exceed more than two pages.
Relate the predictive analytics life cycle from your lectures, SAS diagram created in this case study and
the
SEMMA analytics methodology proposed by SAS. You can use diagrams with brief explanations.
You can refer to the link:
https://en.wikipedia.org/wiki/SEMMA and also read the article SAS_SEMMA in
your assignment folder.
(This section is based on your understanding of the flow of process diagram in this case study. The
objective of this question is to get you to think deeper and ‘connect’ the generic predictive analytics life
cycle discussed in the lectures with the SAS specific (particular vendor and tool specific) SEMMA
methodology (this is generic to SAS) and then also relate to a specific project using the SAS diagram for
the project.)
**If you need to add additional diagrams and tables, that should be incorporated to Appendix with
relevant question numbers. However, the appendix section should be limited to less than five pages.
**Answers to each section and subsection should be clearly numbered. All Diagrams, charts and tables
should be clearly visible.