Data Science assignment

26 views 9:40 am 0 Comments March 12, 2023

This is a Data Science assignment, It is mostly math related questions. Please show all the calculations for each question and there is no coding required for these questions.

1)

We would like to know if the age of a child is related to the number of cavities he or she has. The data are shown below. If there is a significant relationship, predict the number of cavities for a child of 11.

(20 points)

Age of child x

6

8

9

10

12

14

No. of cavities y

2

1

3

4

6

5

2)

Assume we gathered a random sample of the following dataset, where the independent variable (x) represents the number of hours a student studies, and the dependent variable y represents the exam score of the student. Is there a correlation between the two variables, and if so, how strong this correlation is?

(20 points)

Hours of study(X)

Exam score(Y)

6

40

10

50

18

100

15

80

12

65

16

90

3)

The average age of a vehicle registered in the United States is 8 years, or 96 months. Assume that the standard deviation is 16 months. If a random sample of 36 vehicles is selected, find the probability that the mean of their ages is between 90 and 100 months. (10 points)

Hint: need to use the concept of the normal distribution and z score.

4)

Assume we gathered a random sample of the following dataset. Each column represents weekly sales of two stores. We would like to decide which store (A or B) most likely to predict their weekly sales with more certainty. (20 points)

Store A

Store B

2000

2500

4500

6500

3000

2000

1500

5000

6000

1200

4200

7000

5)

Assume we gather a random sample of the following dataset. We are trying to predict the body fat % of a person based on his/her weight in kg.

(30 points)

Error! Filename not specified.

a)

Find the best fitted line of the given data above.

b)

Find the R-squared value.

c)

Find the F value of the best fitted line.

d)

Why your best fitted line does better in predicting comparing to this line equation:

Y = 0.5x + 3.

6)

Build a Decision Tree Classification based on the following dataset. There are three independent variables (a1, a2, a3) that will help with the prediction, and the ‘Classification’ column is the dependent variable. (40 points)

Error! Filename not specified.

7)

Consider the following confusion matrix:

(10 points)

Predicted Yes

Predicted No

Actual Yes

95

5

Actual No

5

45

a)

Calculate the sensitivity, precision, and accuracy of the confusion matrix

b)

Define (give the values of) type I and type II errors in the given confusion matrix and explain the difference between the two.