Data Mining

71 views 9:43 am 0 Comments March 20, 2023

CISC 5790: Data Mining
Fordham University, Spring 2023 Prof. Yijun Zhao
Assignment 2
Due: March 20
Submission Instructions
If you use Python, create a README file, with simple, clear instructions on
how to compile and run your code.
If the TA cannot run your program by following the
instructions, you will receive 50% of the programming score.
If you use Weka, submit screenshots to show your work.
Zip all your files (code, README, written answers, etc.) in a zip file named
ffirstnameg flastnameg CS5790 HW2:zip and upload it to Blackboard
Note: max score for this assignment will be 100 for Weka solutions, and extra 20
bonus points for programming solutions. 10 out of these 20 bonus points are for Q1
and Q2, and another 10 points for Q3.
Q1 and Q2 are based on the following 3 datasets. Each dataset has a training and a test file.
Specifically, these files are:

dataset 1: train-100-10.csv
dataset 2: train-100-100.csv
dataset 3: train-1000-100.csv
test-100-10.csv
test-100-100.csv
test-1000-100.csv

Start the experiment by creating 3 additional training files from the train-1000-100.csv by taking the first 50, 100, and 150 instances respectively. Call them: train-50(1000)-100.csv, train-
100(1000)-100.csv, train-150(1000)-100.csv. The corresponding test file for these dataset would be
test-1000-100.csv and no modification is needed.
1. (30 points) Implement
L2 regularized linear regression algorithm with λ ranging from 0 to
150 (integers only). For each of the 6 dataset, plot both the training set MSE and the test
set MSE as a function of
λ (x-axis) in one graph.
(a) For each dataset, which
λ value gives the least test set MSE?
(b) For each of datasets 100-100, 50(1000)-100, 100(1000)-100, provide an additional graph
with
λ ranging from 1 to 150.
(c) Explain why
λ = 0 (i.e., no regularization) gives abnormally large MSEs for those three
datasets in (b).
2. (40 points) From the plots in question 1, we can tell which value of
λ is best for each dataset
once we know the test data and its labels. This is not realistic in real world applications.
1

In this part, we use cross validation (CV) to set the value for λ. Implement the 10-fold CV
technique discussed in class (pseudo code given in Appendix A) to select the best
λ value
from the
training set.
(a) Using CV technique, what is the best choice of
λ value and the corresponding test set
MSE for each of the six datasets?
(b) How do the values for
λ and MSE obtained from CV compare to the choice of λ and
MSE in question 1(a)?
3. (30 points) Implement Feature Selection
You will apply filter method to perform feature selection on a variant of the UCI vehicle
dataset in the file
veh-prim.arff.
(a) Make the class labels numeric (set noncar”=0 and car”=1) and calculate the Pearson
Correlation Coefficient (PCC) of each feature with the numeric class labels. The PCC
value is commonly referred to as
r. List the features from highest jrj (the absolute value
of
r) to lowest, along with their jrj values.
Note: For a simple method to calculate the PCC that is both computationally efficient
and numerically stable, see the pseudo code in the
pearson.html file.
(b) Why would one be interested in the absolute value of
r rather than the raw value?
(c) From the sorted list obtained in (a), select the top
m features from the list, and run
your KNN algorithm on the dataset restricted to only those
m features. Use LOOCV
to measure the performance and fix the KNN parameter to be
k = 7 for all
runs of LOOCV.
Which value of m gives the highest LOOCV classification accuracy,
and what is the value of this optimal accuracy?
Weka users can following this tutorial to learn how to perform feature selection
in Weka:
https://machinelearningmastery.com/perform-feature-selection-machine-learning-data-weka/
Appendix A
10-Fold Cross Validation for Parameter Selection
Cross Validation is the standard method for evaluation in empirical machine learning. It can also
be used for parameter selection if we make sure to use the training set only.
To select parameter
λ of algorithm A over an enumerated range λ 2 [λ1; : : : ; λk] using dataset D,
we do the following:
2

1. Split the data D into 10 disjoint folds.
2. For each value of
λ 2 [λ1; : : : ; λk]:
(a) For
i = 1 to 10
Train A on all folds but ith fold
Test on ith fold and record the test MSE on fold i
(b) Compute the average test MSE across all 10 folds as the performance measure for the
chosen
λ
3. Pick the value of λ with the best performance (i.e., smallest average test MSE)
Now, in the above,
D only includes the training data and the parameter is chosen without knowledge
of the test data. We then perform a final round of training on the entire training set
D using the
selected
λ value and evaluate the model on the test set.
3