FIT3152 Data analytics

147 views 10:34 am 0 Comments July 24, 2023

Faculty of Information Technology FIT3152 Data analytics – 2023 Quiz and Practical Activity – Sample Answers Your task • You will be given a set of multiple choice and longer questions to answer. The questions will cover topics taught during Weeks 1 – 9. Value and Structure • This assignment is worth 20% of your total marks for the unit. • It has 30 marks in total, comprised of • 6 multiple choice questions of 1 Mark each, • 3 free responses of 2 Marks each, and • 3 grouped free responses of 6 Marks each. Time • You will have 1 Hour during tutorial time to complete the test. Due Date Your scheduled tutorial during Week 11 Submission • Via Moodle Quiz Generative AI Use • In this assessment, you must not use generative artificial intelligence (AI) to generate any materials or content in relation to the assessment task. Late Penalties • This activity can only be deferred/re-scheduled on medical or other serious grounds with relevant documentation. Instructions • Answer the questions on the Moodle Quiz. • The activity is closed book, therefore lecture and tutorial notes or online references are not permitted. • You may use any calculator (physical or digital). • You must keep your camera on if you are in an online tutorial. NOTE You will be asked to stop this activity early and submit what you have done if: • You are found to be using any software other than that permitted. • You are found to be accessing web sites or online resources other than the Moodle Quiz. • You are found to be communicating with any other student. • You are found using online resources besides the Moodle Quiz. • You are found to be cheating in any way.2 Multiple Choice (1 Mark) The following points (P1 – P6) are to be clustered using hierarchical clustering and applying MIN to the distance matrix below. Which pair of points are in the first merge? A. P2, P4 B. P3, P4 C. P1, P6 D. P1, P4 E. P4, P5 P1 P2 P3 P4 P5 P6 P1 0.0 0.4 2.5 1.5 1.4 0.2 P2 0.0 0.4 3.9 1.7 0.6 P3 0.0 2.8 0.8 1.9 P4 0.0 0.1 2.0 P5 0.0 1.3 P6 0.03 Multiple Choice (1 Mark) The table below shows a classification model for 10 customers based on whether or not they did buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction. Customer Confidence-buy Did-buy C01 0.8823 0 C02 0.5547 0 C03 0.6469 1 C04 0.1252 0 C05 0.7050 0 C06 0.7065 1 C07 0.1441 0 C08 0.7398 1 C09 0.7865 1 C10 0.4874 0 What is the lift value if you target the top 50% of customers that the classifier is most confident of? A. 0.2 B. 0.5 C. 1.5 D. 2.0 E. 2.54 Multiple Choice (1 Mark) The ROC chart for a classification problem is given below. Give an estimate of classifier performance (AUC). AUC = 0.7917 (Exact) A. 0.1 B. 0.2 C. 0.5 D. 0.6 E. 0.85 Multiple Choice (1 Mark) 15 observations were sampled at random from the Iris data set. The dendrogram resulting from clustering, based on their sepal and petal measurements, is below. What is the smallest number of clusters that would put all of species Setosa (observations 1:50) in a cluster their own. A. 1 B. 2 C. 3 D. 5 E. 156 Multiple Choice (1 Mark) Predict the output from the following commands: > X <- c(1, 2) > Y <- c(3, 4) > X + Y A. 4, 6 B. 3, 7 C. 10 D. 1234 E. 1, 2, 3, 47 Multiple Choice (1 Mark) An artificial neural network (ANN) is to be used to classify whether or not to Buy a certain product based on Popularity, Sales and Performance. An extract of the data is below. (a) How many input nodes does the ANN require for this problem? [1 Mark] Pop (3) + Sales (1) + Perf (1) = 5 A. 1 B. 2 C. 3 D. 4 E. 5 ID Popularity Sales Performance Buy 1 low 330000 0.87 Maybe 2 medium 40000 0.22 No 3 low 50000 NA Yes 4 high 30000 0 Yes 5 low 100000 0.1 No 6 medium NA 0.06 No … … … … …8 Free Response (2 Marks) The table below shows a classification model for 10 customers based on whether or not they did buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction. Customer Confidence-buy Did-buy 50%CL C01 0.8823 0 1 C02 0.5547 0 1 C03 0.6469 1 1 C04 0.1252 0 0 C05 0.7050 0 1 C06 0.7065 1 1 C07 0.1441 0 0 C08 0.7398 1 1 C09 0.7865 1 1 C10 0.4874 0 0 If a confidence level of 50% or greater is required for a positive classification, what is the Accuracy of the model? TP = 4; FP = 3; TN = 3; FN = 0 [1 Mark all correct] Acc = (TP + TN)/(TP+FP+TN+FN) = 7/10 [1 Mark or H]9 Free Response (2 Marks) A k-Means clustering algorithm is fitted to the iris data, as shown below. rm(list = ls()) data(“iris”) ikfit = kmeans(iris[,1:2], 4, nstart = 10) ikfit table(actual = iris$Species, fitted = ikfit$cluster) Based on the R code and output below, answer the following questions. > ikfit K-means clustering with 4 clusters of sizes 24, 53, 41, 32 Cluster means: Sepal.Length Sepal.Width 1 4.766667 2.891667 2 5.924528 2.750943 3 6.880488 3.097561 4 5.187500 3.637500 Within cluster sum of squares by cluster: [1] 4.451667 8.250566 10.634146 4.630000 (between_SS / total_SS = 78.6 %) > table(actual = iris$Species, fitted = ikfit$cluster) fitted actual 1 2 3 4 setosa 18 0 0 32 versicolor 5 34 11 0 virginica 1 19 30 0 If clustering was used to discriminate between the irises, what would be the accuracy of the model? Explain your reasoning. Assign each displacement to the cluster having greatest number of members. Assume these are the TPs and then work out accuracy as usual. [1 Mark] For example: assume C1 and C4 are setosa, C2 is versicolor, C3 is virginica. Correct classified = (18 + 34 + 30 + 32)/Total = 150, Accuracy = 0.76. accept any reasonable similar approach. [1 Mark]10 Free Response (2 Marks) Use the data below and Naïve Bayes classification to predict whether the following test instance will be happy or not. Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? ) YES P(young/Y) P(professor/Y) P(F/Ys) Product p(yes) 0.5 0.250 0.250 0.500 0.016 NO P(young/N) P(professor/No) P(F/N) Product p(no) 0.5 0.250 0.250 0.750 0.023 Correct calculations [1 Mark] So classify as Happy = No [1 Mark or H] ID Age Range Occupation Gender Happy 1 Young Tutor F Yes 2 Middle-aged Professor F No 3 Old Tutor M Yes 4 Middle-aged professor M Yes 5 Old Tutor F Yes 6 Young Lecturer M No 7 Middle-aged lecturer F No 8 Old Tutor F No11 Free Response (6 Marks) The DunHumby (DH) data frame records the Date a Customer shops at a store, the number of Days since their last shopping visit, and amount Spentfor 20 customers. The first 4 rows are shown below. > head(DH) customer_id visit_date visit_delta visit_spend <int> <chr> <int> <dbl> 1 40 04-04-10 NA 44.8 2 40 06-04-10 2 69.7 3 40 19-04-10 13 44.6 4 40 01-05-10 12 30.4 The following R code is run: DHY = DH[as.Date(DH$visit_date,”%d-%m-%y”) < as.Date(“01-01-11″,”%d-%m-%y”),] CustSpend = as.table(by(DHY$visit_spend, DHY$customer_id, sum)) CustSpend = sort(CustSpend, decreasing = TRUE) CustSpend = head(CustSpend, 12) CustSpend = as.data.frame(CustSpend) colnames(CustSpend) = c(“customer_id”, “amtspent”) DHYZ = DHY[(DHY$customer_id %in% CustSpend$customer_id),] write.csv(DHYZ, “DHYZ.csv”, row.names = FALSE) g = ggplot(data = DHYZ) + geom_histogram(mapping = aes(x = visit_spend)) + facet_wrap(~ customer_id, nrow = 3) Describe the data contained in the data frame “CustSpend.” [2 Marks] Total spend for each customer (before date) [1 Mark] For top 12 customers [1 Mark] Describe the data contained in the data frame “DHYZ.” [2 Marks] DH data frame (cols removed Difference added) [1 Mark] For top 12 customers (in CustSpend) [1 Mark] Describe the contents of the graphic shown by plot “g.” [2 Marks] Histogram of visit spend [1 Mark] Facetted by customer (for top 12) [1 Mark]12 Free Response (6 Marks) A World Health study is examining how life expectancy varies between men and women in different countries and at different times in history. The table below shows a sample of the data that has been recorded. There are approximately 15,000 records in all. Country Year of Birth Gender Age at Death Australia 1818 M 9 Afghanistan 1944 F 40 USA 1846 F 12 India 1926 F 6 China 1860 F 32 India 1868 M 54 Australia 1900 F 37 China 1875 F 75 England 1807 M 15 France 1933 M 52 Egypt 1836 M 19 USA 1906 M 58 … … … … Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of types), or another graph type of your choosing, suggest a suitable graphic to help the researcher display as many variables as clearly as possible. Explain your decision. Which graph elements correspond to the variables you want to display? Appropriate main graphic by name [1 Mark] For example scatter plot or heat map. Accept another type with justification. Mapping of variables to graphic (Country) [1 Mark] Age at death and other vars are grouped by country using colour or position or labels. Other mapping with justification. Mapping of variables to graphic (Year of birth) [1 Mark] Year of birth is position or panel. Other mapping with justification. Mapping of variables to graphic (Gender) [1 Mark] Panel, position or colour. Other mapping with justification. Mapping of variables to graphic (Age at death) [1 Mark] Size, colour or position. Other mapping with justification. Data reduction or summary calculation [1 Mark] How data is grouped and reduced. Averaging etc.13 Free Response (6 Marks) A researcher wants to predict the prevalence of crime in towns, using the following data. Crm: Crime rate in the town; Ind: Proportion of the town zoned industrial. Pol: Air pollution in the town (ppm) Rms: Number of main rooms in the house Tax: Land tax paid ($) Str: Student to teacher ratio in local schools Zone: Socio-economic zone of house location Val: Value of the house ($000) > head(Cdata) Crm Ind Pol Rms Tax Str Zone Val 1 0.00632 2.31 0.538 6 296 15.3 0 2400 2 0.02731 7.07 0.469 6 242 17.8 1 2160 3 0.02729 7.07 0.469 7 242 17.8 0 3470 4 0.03237 2.18 0.458 6 222 18.7 0 3340 Based on the R code and output below, answer the following questions. > contrasts(Cdata$Zone) = contr.treatment(3) > Crime = lm(Crm~.,data = Cdata); summary(Crime) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.162875 5.324457 -0.97 0.333 Ind -0.160716 0.078481 -2.05 0.041 * Pol 4.791271 4.443372 1.08 0.281 Rms 0.051432 0.500037 0.10 0.918 Tax 0.025699 0.002902 8.86 <2e-16 *** Str 0.041439 0.177346 0.23 0.815 Zone1 -1.843825 1.198360 -1.54 0.125 Zone2 3.244316 1.702931 1.91 0.057 . Val -0.001216 0.000582 -2.09 0.037 * — > contrasts(Cdata$Zone) 2 3 0 0 0 1 1 0 How does the proportion of the town zoned industrial affect crime rate? How reliable is the evidence? Increasing proportion of industrial reduces crime rate [1 Mark] Reliable. Significance is high (p < 0.05) [1 mark] How does air pollution affect crime rates? How reliable is the evidence? Have positive coefficient but can’t really tell. [1 Mark] Reliability low. P-value/ Significance is low (0.281) [1 mark] Why is Zone ‘0’ not defined in the regression output? How is it included in the model? Zone 0 is the default contrast (having coefficient 0) [1 mark] It is implicitly included in the intercept [1 mark]14 Free Response (6 Marks) – Extra Example!! The table below shows the survey results from 12 people, who were asked whether they would accept a job offer based on the attributes: Salary, Distance, and Social. We want to build a decision tree to assist with future decisions of whether a person would accept a Job or not. ID Salary Distance Social Job 1 Medium Far Poor No 2 High Far Good Yes 3 Low Near Poor No 4 Medium Moderate Good Yes 5 High Far Poor Yes 6 Medium Far Good Yes 7 Medium Moderate Poor No 8 Medium Near Good Yes 9 High Moderate Poor Yes 10 Medium Near Poor Yes 11 Medium Moderate Poor Yes 12 Low Moderate Good No What is the entropy of Job? Yes = 8 instances, No = 4 instances. [1 mark] Entropy = -8/12 log2(8/12)-4/12 log2(4/12) = 0.9184. [1 mark] Without calculating information gain, which attribute would you choose to be the root of the decision tree? Explain why. Salary [1 Mark] Purest leaves. (High/Low homogenous leaves) [1 Mark] What is the information gain of the attribute you chose for the previous question? Entropy(Salary = high) = 0; Entropy(Salary = low) = 0 Entropy(Salary = medium)=-5/7log2(5/7)–2/7log2(2/7)= 0.8632 [1 Mark] EEntropy(Salary) = 7/12*0.8632 = 0.5035 [1 Mark or H] Information gain = 0.9184 – 0. 5035 = 0.4149 [1 Mark or H up to max of 2 Marks]15 Formulas and references The Visualization Zoo – Graphic Types Time-Series Data • Index Charts • Stacked Graphs • Small Multiples • Horizon Graphs Statistical Distributions • Stem-and-Leaf Plots • Q-Q Plots • SPLOM • Parallel Coordinates Maps • Flow Maps • Choropleth Maps • Graduated Symbol Maps • Cartograms Hierarchies • Node-Link diagrams • Adjacency Diagrams • Enclosure Diagrams Networks • Force-Directed Layouts • Arc Diagrams • Matrix Views Entropy If S is an arbitrary collection of examples with a binary class attribute, then: ???????(?) = −?!”???#(?!”)−?!#???#(?!#) = − ? !” ? ???# 2 ? !” ? 3 − ? !# ? ???# 2 ? !# ? 3 where ?1 ??? ?2 are the two classes. ?!” ??? ?!# are the probability of being in Class 1 or Class 2 respectively. ?!” ??? ?!# are the number of examples in each class. ? is the total number of examples. Note: ???#? = $%& !” ‘ $%& !” # = $%& !”‘ (.*(” Information gain The ????(?, ?) of an attribute A relative to a collection of examples, S, with v groups having |?+| elements is: ????(?, ?) = ???????(?) − 3 |? ! | |? | !∈#$%&'((*) ∗ ???????(?!) Accuracy ??? = ?? + ?? ?? + ?? + ?? + ?? ROC ??? = ?? ?? + ?? , ??? = ?? ?? + ?? Naïve Bayes’ ??? ?????? ?”, ?#, … , ?, ??? ????? ?, the classification probability is

Tags: , , , , , , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *