Data Mining

165 views 8:28 am 0 Comments June 6, 2023

School of Computer Science and Software Engineering
University of Wollongong
INFO411/911 Data MiningInternally Assure the Quality of Assessment
Assignment 1 – 7.5 Marks
Due: Week 7, on 13/April/2017 before
23:55 AEST (Australian Eastern Standard Time).
Submission of the answers must be done online by using the submission link that is associated with this
subject for assignment 1 on MOODLE . One PDF document is to be submitted. The PDF must contain
typed text of your answer (do not submit a scan of a handwritten document). The document can
include computer generated graphics and illustrations (hand drawn graphics and illustrations will be
ignored). The size limit for this PDF document is 20MB. All questions are to be answered. An clear and
complete explanation and analysis needs to be provided with each answer.
Submissions made after the due time will be assessed as late submissions and are counted in full day
increments (i.e. 1 minute late counts as a 1 day late submission). There is a 25% penalty for each day
after the due date. The submission site closes four days after the due date. No submission will be
accepted after the submission site has closed.
This is an individual assignment. Plagiarism of any part of the assignment will result in having 0
marks for all students involved.
You may need to do some research on background information for this assignment. For example, you
may need to develop a deeper understanding of writing code in R.
What you need:
The R software package, the file assignment1.zip from the Moodle site, and the file taxi.csv.zip from
http://teaching.cs.uow.edu.au/~markus/data/taxi.csv.zip . Caution: The file taxi.csv.zip is 314MB in size,
uncompressed the file is 1.2GB in size!
Task1
Preface: The analysis of results from urban mobility simulations provide very valuable data for the
identification and addressing of problems in an urban road network. Public transport vehicles such as busses
and taxis are often equipped with GPS location devices and the location data is submitted to a central server
for analysis.
The metropolitan city of Rome, Italy collected location data from 320 taxi drivers that work in the center of
Rome. Data was collected during the period from 01/Feb/2014 until 02/March/2014. An extract of the dataset
is found in taxi.csv. The dataset contains 4 attributes:
1. ID of a taxi driver. This is a unique numeric ID.
2. Date and time in the format Y:m:d H:m:s.msec+tz, where msec is micro-seconds, and tz is a timezone adjustment. (You may have to change the format of the date into one that R can understand).
3. Latitude
4. Longitude
For a further description of this dataset: http://crawdad.org/roma/taxi/20140717/
Purpose of this task:
Perform a general analysis of this dataset. Learn to work with large datasets. Obtain general information of
the behaviour of some taxi drivers. Analyse and interpret results. This task also serves as a preparation for a
project that will be based on this dataset.

Questions:
By using the data in taxi.csv perform the following tasks:
(4 marks)

(a) Plot the location points (2D plot), clearly indicate the points that are outliers or noise points. The plot
should be informative! Remove outliers and noise points before answering the subsequent subquestions. Explain reasons to why you defined the removed points as noise points.
(b) Compute the minimum, maximum, and mean location values.
(c) Obtain the most active, least active, and average activity of the taxi drivers (most time driven, least
time driven, and mean time driven)
(d) Look at the file Student_Taxi_Mapping.txt. The file contains two columns. The first column is a 4-
digit student code, the 2nd column is the ID of a taxi driver. Use the
first and last three digits of
your student number
, locate that number in the first column of the file Student_Taxi_Mapping.txt
then use the ID of the taxi driver listed in column 2. Thus, for example, if your student number is
52345678 then you would look up 5678 in file Student_Taxi_Mapping.txt to find that the
corresponding taxi ID is 50. Use the taxi ID that matches your 4-digit student code to answer the
following questions:
i. Plot the location points of taxi=ID
ii. Compare the mean, min, and max location value of taxi=ID with the global mean, min, and max.
iii. Compare total time driven by taxi=ID with the global mean, min, and max values.
iv. Compute the distance traveled by taxi=ID. To compute the distance between two points on the
surface of the earth use the following method:
dlon = lon2 ­ lon1
dlat = lat2 ­ lat1
a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2
c = 2 * atan2( sqrt(a), sqrt(1­a) )
distance = R * c (where R is the radius of the Earth)
Assume that R=6,371,000 meters.
Task 2
Preface: Banks are often posed with a problem to whether or nor a client is credit worthy. Banks commonly
employ data mining techniques to classify a customer into risk categories such as category A (highest rating)
or category C (lowest rating).
A bank collects data from
past credit assessments. The file creditworthiness.csv contains 2500 of such
assessments. Each assessment lists 46 attributes of a customer. The last attribute (the 47-th attribute) is the
result of the assessment. Open the file and study its contents. You will notice that the columns are coded by
numeric values. The meaning of these values is defined in the file definitions.txt. For example, a value 3 in
the 47-th column means that the customer credit worthiness is rated “C”. Any value of attributes not listed in
definitions.txt is “as is”.
This poses a “prediction” problem. A machine is to learn from the outcomes of past assessments and, once
the machine has been trained, to assess any customer who has not yet been assessed. For example, the value
0 in column 47 indicates that this customer has not yet been assessed.
Purpose of this task:
You are to start with an analysis if the general properties of this dataset by using visualization and clustering
techniques (i.e. Such as those introduced during the lectures), and you are to obtain an insight into the degree
of difficulty of this prediction task. Then you are to design and deploy an appropriate supervised prediction
model (i.e. MLP as will be used in the lab of week 5) to obtain a prediction of customer ratings.
Question 1: (2 marks)
Analyse the general properties of the dataset and obtain an insight into the difficulty of the prediction task.
Create a statistical analysis of the attributes, then list 5 of the most interesting (or most valuable) attributes.
Explain the reasons that make these attributes interesting. Note
A set of R-script files are provided with this assignment (included in the zip-file). These are similar to the
scripts used in lab1. The scripts provided will allow you to produce some first results. However, virtually
none of the parameters used in these scripts are suitable for obtaining a good insight into the general
properties of the given dataset. Hence your task is to modify the scripts such that informative results are
obtained from which conclusions about the learning problem can be made. Note that finding a good set of

parameters is often very time consuming in data mining.
An additional challange is to make a correct interpretation of the results.
This is what you need to do: Find a good set of parameters (i.e. Through a trial and error approach), obtain
informative results then offer an interpretation of the results. Write down your approach to conducting the
experiments, explain your results, and offer a comprehensive interpretation of the results. Do not forget that
you are also to provide an insight into the degree of difficulty of this learning problem (i.e. From the results
that you obtained, can it be expected that a prediction model will be able to obtain 100% prediction
accuracy?). Always explain your answers.
Question 2: (1.5 marks)
Deploy a prediction model to predict the credit worthiness of customers which have not yet been assessed.
The prediction capabilities of the MLP in lab4 was very poor. Your task is to:
a.) Describe a valid strategy that maximises the accuracy of predicting the credit rating. Explain why
your strategy can be expected to maximize the prediction capabilities.
b.) Use your strategy to train MLP(s) then report your results. Give an interpretation of your results.
What is the best classification accuracy (expressed in % of correctly classified data) that you can
obtain for data that were not used during training (i.e. The test set)?
What you need:
The R software package (Rstudio is optional) and the file assignment1.zip. Successful completion of lab 4 in
week 5. You may use the R-script of lab4 as a basis for attempting this question.
Note that in this assignment the term “prediction capabilities” refer to a model’s ability to predict the credit
rating of samples that were not used to train the model (i.e. samples in a test set).
The answers to both tasks of this assignment should be provided with a single PDF document which is to be
submitted. Submit one single PDF document that contains your answers to both tasks of this assignment.
Submit before the due date and follow the submission procedure as described in the header of this
assignment.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , ,