STAT6030 - Australia Assignments

STAT6030 GENERALISED LINEAR MODELLING The Australian National University Assignment 2 2023 Summer Session STAT6030 GENERALISED LINEAR MODELLING The Australian National University Assignment 2 2023 Summer Session Instructions • This assignment is worth 55 marks in total and 25% of your overall marks for this course. The assignment is compulsory and must be submitted by 5pm on Monday 6 March 2023. • You must write your answers to this assignment individually and by yourself. If you copy someone else’s work or allow your work to be copied, you will receive a mark of zero for the assignment and risk severe academic consequences. • Your answers should be individually submitted through Turnitin on Wattle as a single pdf/Word document (less than 50MB) including the following: 1. The assignment Cover Sheet (available on Wattle). 2. Your answers (no more than 10 pages including graphs, summaries, tables, etc… but not Appendix and Cover Sheet, and respecting the other requirements for each part). 3. An Appendix including all the R commands you used (no page limit). • Assignments should be typed and not handwritten. Your assignment may include some carefully edited R output (e.g., graphs, summaries, tables, etc…) and appropriate dis- cussion of these results, as well as some selected R commands. Please be selective about what you present and only include as many pages and as much R output as necessary to justify your solution. Clearly label each part and question of your assignment and appendix with the corresponding numbers. • Unless otherwise advised, use a significance level of 5%. • Round numeric answers to 4 decimal places (e.g., 0.00115 is rounded to 0.0012). • Marks will be deducted if these instructions are not strictly respected, especially when the total report is of an unreasonable length, i.e., more than the above page limit. The Appendix will generally not be marked and checked if what you have written or done needs clarifications. • Name your submission “CourseCode Uid”, e.g., “STAT6030 u1234567”. • Try to submit your assignment at least 30 minutes before the deadline in case something unexpected happens, for instance an internet connection problem. • Late submissions will NOT be accepted. Extensions will usually be granted on medical or compassionate grounds on production of appropriate evidence, but must receive lecturer’s approval at least 24 hours before the deadline. 1 Part 1 [16 Marks] Please provide your answers to the following questions and include short working out if there is any. There is a limit of 3 pages on your answers for Part 1. (a) [1 mark] What is the definition of canonical link function in the context of generalised linear models? (b) [1 mark] Explain in words and/or by drawing a plot when a link function of a generalised linear model is valid. (c) [1 mark] In the context of generalised linear models, does the value of the maximised log-likelihood for the saturated model depend on the choice of link function and why? (d) [1 mark] The mean of a generalised linear model is known to lie between 1 and 2 whatever the value of the linear predictor ηi = x⊤i β is, i.e. 1 < μi < 2. Let Φ denote the cumulative distribution function of the standard normal distribution N(0,1) and Φ−1 denote the inverse function of Φ. Which function below is an appropriate link function in this setting? Notes: (i) precisely one answer below is correct and the other ones are incorrect; (ii) an incorrect answer scores zero while the correct answer scores full marks for the question. A. ηi = g(μi), where g(μi) = Φ(μi − 1). B. ηi = g(μi), where g(μi) = Φ(μi/2). C. ηi = g(μi), where g(μi) = Φ−1(μi − 1). D. ηi = g(μi), where g(μi) = Φ−1(μi/2). (e) [1 mark] The gamma distribution has probability density function f(y; α, β) = {βα/Γ(α)}yα−1 exp(−βy), where y > 0, α > 0 is a shape parameter, β > 0 is a rate parameter and Γ(·) is the gamma function. You may assume that (i) the mean μ of the gamma distribution is given by μ = α/β; (ii) the gamma distribution is a generalised linear model with dispersion parameter φ = 1/α, in the notation of equation (4.1) of Topic 4. What is the canonical link function when the generalised linear model is gamma? (f) [3 marks] The geometric distribution has probability mass function f(y; p) = (1 − p)py, for y = 0, 1, . . ., where 0 < p < 1. What are the canonical link function and variance function of the geometric distribution? The deviance residual for observation i is given by sign(yi − μˆi)pd2i , where d2i = φ2 ?yi?h(yi) − h(μˆi) − ?b?h(yi)? − b?h(μˆi)? ? is the deviance associated with observation i, which is written as a function of the response variable yi and of the fitted value μˆi, while sign(·) is the sign function defined in the lecture notes. Also recall that b′−1(μ) ≜ h(μ). What is the expression for d2i , as a function of yi and μˆi, when the generalised linear model is geometric? Please simplify your expression as much as you can. 2 (g) [1 mark] Consider a generalised linear model with linear predictor ηi = υi + x⊤i β, where υi is an offset, xi is a vector of covariates of length p and β is a parameter vector of length p to be estimated. Assuming that the model’s dispersion parameter φ = 1 is known, how many free parameters (i.e., parameters to estimate) are there in this model? (h) [1 mark] A logistic regression model was fitted to a dataset consisting of a binary outcome variable, yi, taking values 0 and 1, and a single numerical covariate xi. The estimated intercept and slope on the linear predictor scale were found to be −0.47 and 1.3, respectively, so that the linear predictor as a function of xi is given by ηˆ(xi) = −0.47 + 1.3xi. Recall the estimated probability Prob[yi = 1|xi] is given by Prob[yi = 1|xi] = exp{ηˆ(xi)}/[1 + exp{ηˆ(xi)}] and so the estimated probability Prob[yi = 0|xi] is given by 1 − Prob[yi = 1|xi]. What is the value of xi such that the odds of the event yi = 1 is 0.75? Recall that the odds of an event that occurs with probability π is given by π/(1 − π). (i) [2 marks] Consider a distribution with the probability density function f(y; μ) = [1/(2πy3)]−1/2 exp[−(y − μ)2/(2μ2y)], where μ is the mean of the distribution and y > 0. What is the variance function, V (μ), of this distribution? (j) [1 mark] The following output from a linear regression model fit in R was obtained. Calculate the value for ++++ that the R program would give if the sample size is 10. Call : lm(formula = y ̃ x) Coefficients : Estimate ( Intercept ) −0.08888 x 1.06903 Std . Error 0.66793 0.10765 t value −0.133 ???? Pr(>|t|) 0.897 ++++ (k) [1 mark] Suppose we fit a Poisson regression model A with log link to a dataset whose response variable is a count. No offset is included. In the fitted model we have included a covariate x and the estimated coefficient of x is βˆA. Suppose that we then decide to fit a second model B which is the same as model A but with x included as an offset as well as included in the linear predictor as before. Suppose the estimated coefficient of x is βˆB in model B. Which of the following statements about the second fitted model is correct? Notes: (i) precisely one answer below is correct and the other ones are incorrect; (ii) an incorrect answer scores zero while the correct answer scores full marks for the question. A. βˆB = βˆA − 1 and the residual deviance of model B will (usually) change compared to that of model A. 3 B. βˆB = βˆA − 1 and the residual deviance of model B will not change compared to that of model A. C. βˆB = βˆA + 1 and the residual deviance of model B will (usually) change compared to that of model A. D. βˆB = βˆA + 1 and the residual deviance of model B will not change compared to that of model A. (l) [2 marks] Suppose we have fitted a Poisson log-linear regression with extra-Poisson variation and the estimate of the dispersion parameter φ is greater than 1. If the standard Poisson model was used in this situation, would this be likely to be a case of underdispersion or overdispersion, and which assumption between mean and variance of the Poisson distribution should fail? What would happen to the estimates of the β parameters for the standard Poisson model? 4 Part 2 [12 Marks] Different doses of two chemicals, A and B, were used in a trial whose purpose was to reduce cockroach numbers. The variable x1 gives the dose of chemical A and the variable x2 gives the dose of chemical B. In the R code below, the first column of c gives the number of cockroaches killed and the second column of c gives the number of cockroaches that survived. The following R outputs were obtained: Please provide your answers to the following questions and include short working out if there is any. There is a limit of 2 pages on your answers for Part 2. (a) [1 mark] What type of generalised linear model is being fitted here and what link function is being used? 5 (b) [5 marks] Determine the missing information indicated by the letters A, B, C, D, E, F, G, H, J and K. Note that for E you are required to specify the link function. (c) [2 marks] Write down the relevant model in mathematical form, focusing on the contri- bution of observation i to the model. (d) [2 marks] Briefly indicate your impressions of the results of the statistical analysis provided above. (e) [2 marks] What are the next questions you would investigate in the statistical analysis? State what your next two steps would be. 6 Part 3 [12 Marks] The presence of sprouted or diseased kernels in wheat can reduce the value of a wheat pro- ducer’s entire crop. It is important to identify these kernels after being harvested but prior to sale. To facilitate this identification process, automated systems have been developed to separate healthy kernels from the rest. Improving these systems requires a better understand- ing of the measurable ways in which healthy kernels differ from kernels that have sprouted prematurely or are infected with a fungus. To this end, Martin et al. (1998) conducted a study examining numerous physical properties of kernels – density, hardness, size, weight, and moisture – measured on a sample of wheat kernels from two different classes of wheat, hard red winter (hrw) and soft red winter (srw) (represented by the categorical variable class) in the wheat.csv dataset on Wattle. Each kernel’s condition was also classified as “Healthy”, “Partly Diseased” and “Diseased” by human visual inspection (represented by the categorical variable type2). Please provide your answers to the following questions and include short working out if there is any. There is a limit of 3 pages on your answers for Part 3. Throughout the following questions, treat type2 as the response variable. Suppose that we have conducted the following R analysis and obtained the R output below: 7 (a) [2marks]Describetheinterpretationsofcoefficientestimates-10.95451and-0.6480912 in the summary() output. (b) [2 marks] What are the null and alternative hypotheses corresponding to the p-value 0.0291 in the Anova() output? What conclusion can you obtain based on the p-value? (c) [2 marks] Suppose we have a new observation of the following form: > xnew=data.frame(class=’srw ’ ,density=1,hardness=25,size=2, + weight=25,moisture=12) > xnew class density hardness size weight moisture 1 srw 1 25 2 25 12 If we use predict(), what are the predicted probabilities for the different categories of the response type2 and what is the prediction of the response type2 for this new observation? Suppose that we conducted further R analysis and obtained the R output below: 8 (d) [2 marks] Describe the interpretations of coefficient estimates -0.17370 and 13.50540 in the summary() output, respectively. (e) [2 marks] What are the null and alternative hypotheses corresponding to the p-value 0.65749 in the Anova() output? What conclusion can you obtain based on the p-value? (f) [2marks]Fitanominallogisticregressionmodelandanordinallogisticregressionmodel, respectively, with covariates class, density, hardness, size, weight, moisture, class:density, class:hardness, class:size, class:weight and class:moisture. Based on the model fitting results, which model is better? Please explain why this model is better. 9 Part 4 [15 Marks] An analysis of some ship damage data is presented below. The data consists of a factor typ, corresponding to ship type, with 3 levels, A, B and C; a factor cons, corresponding to the period of construction of the ship, with 3 levels, 1960-1964, 1965-1969 or 1975-1979; a factor opr, corresponding to years of operation of the ship, with 2 levels, either 1960-1975 or 1975- 1979; a numerical variable mnths, corresponding to the total number of months at risk; and dmge, corresponding to the number of damage incidents reported for the ship. The following R output was obtained. 10 Please provide your answers to the following questions and include short working out if there is any. There is a limit of 2 pages on your answers for Part 4. (a) [1 mark] What type of generalised linear model i being fitted here to obtain the output out1 and what link function is being used? (b) [7 marks] Determine the missing information indicated by the letters A, B, C, D, E, F, G, H, J, K, M, N, P and Q. Note that F should consist of either a blank, a dot, one star, two stars or three stars; and for J you should specify the link function that was used. All the other letters apart from A represent a number. (c) [2 marks] Explain what is meant by an offset and the motivation for offsetting L=log(mnths) rather than mnths itself. (d) [2 marks] Using the R printout for out1, give the value of the linear predictor for a ship of Type A that was constructed in the period 1965-1969 and operated in the period 1975-1979, assuming that mnths=1095. (e) [3 marks] Write down brief notes on what you would conclude about the wave damage data from the R output. Can we draw any conclusions as to whether overdispersion is present in this dataset? What action you would consider taking if overdispersion were suspected to be present.