Visualization and data processing

157 views 7:27 am 0 Comments June 29, 2023

Assessment 2

The following exercises are designed to assess your understanding of concepts, implementation, and interpretation of topics in Visualization and Data Processing. Some questions may require you to search and use R functions that we have not used so far.  In all following questions submit codes and output.

Note: The questions in this assessment may have multiple correct solutions. Hence submission of R code is essential. Almost no statistical background is presumed knowledge for this assessment. All methods required for solution are available on the content pages of Weeks 2-10 of this subject. Some of them have been covered in detail during collaborate sessions.

Answer sheet and script (code).

All outputs must be followed by a short statement, for full marks. You could use on of the following alternatives in responding to the questions.

  1. Please use a new word document to provide your responses sequentially using Section and Question numbers. You could paste your annotated R code followed by the output and discussion, in the word document.
  2.  In a word document provide your output and discussion. Submit the –annotated (commented)- R code separately. Codes without annotation won’t receive full marks.
  3. If you know Rmarkdown, you could use that to create an integrated report. However, Rmarkdown is NOT a requirement for this subject.

M. Function: Section Marks 5

An R package is simply a collection of functions. Each function does a specific task. Write a (single) function that computes –

  1. Arithmetic mean,
  2. Standard deviation, and
  3. Range. Range is the difference between the maximum and minimum values of a variable

for any variable of an R data.frame(). Implement this function on the airquality data available with baseR. Show your code and output.

A. Categorising and Visualization: Section Marks 10

  • Load the R data LifeCycleSavings, from the datasets package, in your R session.

Marks (1)

  • Use the values of the variable pop15 and ddpi to create two new categorical variables named Pop15Cat and ddpiCat, respectively, each with three categories, ‘High’ (top 25%), “Medium”( middle 50%) and “low”(bottom 25%). Show your codes.
    • Hint: To do this, you must find a function to obtain the required cut-off points and then another function to implement the binning of the variables. Marks (4)
  • Count (frequency of) cases within each new categorical variable. Show codes and output. Marks (2)
  • Use a visualization tool to display the relationship between the variables,

sr and Pop15Cat, stratified by ddpiCat, on a single plot.

Comment on the plot – do you observe any relationship between the three variables?

You must use ggplot2 for visualization. Wherever applicable, your plots must have proper labelling and legends. Marks (5)

B. Data Processing: Section Marks 15

  • Pain is often used as an indicator to prioritize (rank) patients for urgency of attention (triage), at hospitals. Consider pain levels reported by 6 patients at a hospital on a weekday (say Wednesday).

Using R, create an ordered categorical (ordinal) vector called, “pain”, of length 6, assuming that it is the data reported by the 6 patients. This vector can take any of the following values – “low”, “moderate”, and “severe” such that the values are ordered as

  • ‘low’ < ‘moderate’ < ‘severe’. Show your code and the output as evidence.
    • Convert the above vector “pain” into an integer variable. Show your code and the output as evidence.  Marks (2)

Hint: You can choose any values for the six observations.

  • Now consider that two observations in the vector “pain” were lost due to a data entry error. Additionally, it is given that the observations on pain were collected for-
    • Australian males
    • in the age group 70-80.

How would you deal with the missing observations on this vector? Please provide your response in no more than two sentences, highlighting two possibilities, with proper justification.

Hint: This is a hypothetical question and you do not need to use a dataset but justify your approach on dealing with missing values in this context. Marks (4)

  • Load the R data iris in your session and,
    •  Show the dimension and the variable names of this dataset using a single R function.
    • Drop the variable “Species”- using a dplyr() function –  and store the new dataset in a new data frame called iris1.
    • Change the first four column names of the new data frame to – sep_len, sep_wid, pet_len, and pet_wid.

Show your code for all steps . Print the first six rows of the data frame iris1.   Marks (3)

  • Proximity analysis on iris flowers.
    • Recommend an R function to compute the appropriate dissimilarity coefficient for the data frame iris1 that you created in the previous question. Justify your choice.
    • Compute the dissimilarity coefficient that you recommended and save it as an object. Show R-code.
    • How many distinct dissimilarities were computed in QB4.2 for flower number 50.
    • Based on your results in QB4.3 find the flowers with most and least similarity with flower 50 on this data. Show your codes and output for identification, along with the statement. Marks (5)

C. Data Processing: Section Marks 20

Clinician scientists at Royal Melbourne hospital are investigating the relationship between fecal calprotectin (FC) as a non-invasive diagnostic alternative to Inflammatory Bowel Disease (IBD) and Acute Sever Ulcerative Colitis (ASUC). It would also contribute to standard of care. A subset of the dataset is provided as bowel.csv. The data has several missing values, causes of missing-ness are often unknown.

In the following show questions show your R working and output.

  1. In the bowel.csv dataset,
    1. Count and state the number of missing observations on the variable Hb and the total number of missing observations in data bowel.csv.  Marks (1)
    • Perform a univariate imputation on the variable Hb. Your solution should include Marks (3)
      1. code,

      result, and

      •  Justification for the choice of imputation value you used. Hint: For justification you may consider using descriptive statistics and/or visualization techniques.
  2. Select two variables that may have association with Hb.  Hint: This is a real data and relationships may appear different from standard textbook patterns.
  3. Justify your choice using dplyr()or baseR tool(s). Show your working in R – code and any output you produce in support of your argument. Marks(4)
  4. Use your investigation in QC 2.a.) to replace all missing values in the variable Hb. Show your code. Please do not use any automatic imputation package or routines such as Hmisc. Marks (6)
  5. Present a comparison – with discussion – between the two types of imputation in the context of variable Hb. Hint: You may use appropriate proximity and/or visualisation tools learned in topics of Weeks 3-7, for comparison. Marks (5).
Tags: , , , , , , , , , , ,