Overview - Australia Assignments

Exam 1
INFO 2950 – Spring 2023
ĺ Important
This exam is due Monday, March 6 at 11:59pm ET. There will be a ffteen-minute
grace period for students who wait until the last minute to submit on Gradescope. Any
submissions received after 12:14am ET will receive an automatic 20% deduction. Submissions will not be accepted after 11:59pm ET on March 7.
Overview
The exam covers all techniques taught thus far in the class. It consists of four (4) data analysis
exercises in R/RStudio and is similar in structure to a homework assignment.
ĺ Important
• See the course site for all the requirements in terms of academic integrity.
• You must submit a rendered PDF of your submission. The code must work in order
to render. If you are unable to fully complete an exercise, partially complete
submissions are highly encouraged. Better to earn some credit rather than
none.
• If you have clarifcation questions as you complete the exam, email info2950@cor
nell.edu.
Getting started
• Go to the info2950-s23 organization on GitHub. Click on the repo with the prefx
exam-01. It contains the starter documents you need to complete the lab.
• Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details
on cloning a repo and starting a new R project.
1
Workﬂow + formatting
Make sure to
• Update author name on your document.
• Label all code chunks informatively and concisely.
• Follow the Tidyverse code style guidelines.
• Make at least 3 commits.
• It is assumed that all visualizations will follow best practices as learned in the class. This
includes (but is not limited to):
– Resize fgures where needed, avoid tiny or huge plots.
– Use informative labels for plot axes, titles, etc.
– Adopt optimal color palettes when a variable is mapped to the color or fill
aesthetic.
• Turn in an organized, well formatted document.
Packages
We’ll use the tidyverse package for much of the data wrangling and visualization, the scales
package for better formatting of labels on visualizations, and lubridate for working with date
and time columns.
Part 1: Tidying messy data
data/wb contains a set of CSV fles with data from World Bank Open Data. Each fle contains
data on a diﬀerent country that is a current member of the UN Security Council. All of the
fles contain the exact same structure.
Exercise 1
For current UN Security Council members, how has infant mortality changed
over time? Construct a faceted line chart to visualize the relevant data, then provide a brief
(no more than one paragraph) answer to the question. Utilize your graph to support your
answer.
Ď Tip
• The data is split across 15 CSV fles. Fortunately as long as every fle contains the
same structure (which they do), you can specify the file argument of read_csv()
2
as a character vector containing the relative flepaths for each of the CSV fles. Provide that character vector as the file argument to import all 15 fles simultaneously.
For this exercise, that would be something like
wb_files <- list.files(
path = “data/wb”,
pattern = “*.csv”,
full.names = TRUE
)
wb_files
[1] “data/wb/API_ALB_DS2_en_csv_v2_4784591.csv”
[2] “data/wb/API_ARE_DS2_en_csv_v2_4799288.csv”
[3] “data/wb/API_BRA_DS2_en_csv_v2_4782858.csv”
[4] “data/wb/API_CHE_DS2_en_csv_v2_4771312.csv”
[5] “data/wb/API_CHN_DS2_en_csv_v2_4773379.csv”
[6] “data/wb/API_ECU_DS2_en_csv_v2_4783022.csv”
[7] “data/wb/API_FRA_DS2_en_csv_v2_4782878.csv”
[8] “data/wb/API_GAB_DS2_en_csv_v2_4801532.csv”
[9] “data/wb/API_GBR_DS2_en_csv_v2_4784641.csv”
[10] “data/wb/API_GHA_DS2_en_csv_v2_4783027.csv”
[11] “data/wb/API_JPN_DS2_en_csv_v2_4775881.csv”
[12] “data/wb/API_MLT_DS2_en_csv_v2_4782026.csv”
[13] “data/wb/API_MOZ_DS2_en_csv_v2_4782930.csv”
[14] “data/wb/API_RUS_DS2_en_csv_v2_4783339.csv”
[15] “data/wb/API_USA_DS2_en_csv_v2_4782210.csv”
• The source data fles are an unusual structure. You may need to adjust your
read_csv() parameters to successfully import them.
• The original data fles are untidy. After successfully importing them, you will need
to tidy them before you can construct a plot. Your tidied data frame should contain
930 rows.
• The specifc variable you need to use is “Mortality rate, infant (per 1,000 live
births)”. It’s code in the dataset is “SP.DYN.IMRT.IN”.
• Ideally facets are arranged in a meaningful order. In the example below, the facets
are ordered based on each country’s frst observed child mortality rate (highest to
lowest).
• The chart below is color-coded to distinguish permanent members from nonpermanent members of the Security Council.
• Follow best practices in constructing your visualization.
3
Your visualization might look like this:
France United Kingdom Switzerland
Japan Russian Federation United States
China Albania Malta
Ghana Ecuador Gabon
Mozambique United Arab Emirates Brazil
1960 1980 2000 2020 1960 1980 2000 2020 1960 1980 2000 2020
0
50
100
150
0
50
100
150
0
50
100
150
0
50
100
150
0
50
100
150
Year
Mortality rate, infant (per 1,000 live births)
Status Non-permanent member Permanent member
For current UN Security Council members
Infant mortality rates over time
Source: The World Bank
Ď Tip
Now is a good time to render, commit (with a descriptive and concise commit message),
and push again. Make sure that you commit and push all changed documents and your
Git pane is completely empty before proceeding.
Part 2: Wrangling and visualizing messy(ish) data
The Supreme Court Database contains detailed information of every published decision of the
U.S. Supreme Court since its creation in 1791. It is perhaps the most utilized database in the
4
study of judicial politics.
In the repository’s data folder, you will fnd two data fles:
1. scdb-case.csv
2. scdb-vote.csv
These contain the exact same data you would obtain if you downloaded the fles from the
original website, but reformatted to be stored as relational data fles. That is, scdb-case.csv
contains all case-level variables, whereas scdb-vote.csv contains all vote-level variables.
The data is structured in a tidy fashion.
• scdb-case.csv contains one row for every case and one column for every variable
• scdb-vote.csv contains one row for every vote by a justice in every case and one column
for every variable
The current dataset contains information on every case decided from the 1791-2021 terms.1
There are several ID variables which can be used to join the data frames, specifcally caseId,
docketId, caseIssuesId, and term. Variables you will want to familiarize yourself with
include:
• dateDecision
• decisionType
• direction
• issueArea
• justice
• justiceName
• majVotes
• minVotes
• term
Ĺ Note
Each variable above is linked to the relevant documentation page in the online code book.
Once you import the data fles, use your data wrangling and visualization skills to answer the
following exercises.
1Terms run from October through June, so the 2021 term contains cases decided from October 2021 – September
2022.
5
Ď Tip
Pay careful attention to the unit of analysis required to answer each question. Some questions only require case-level variables, others only require vote-level variables, and some
may require combining the two data frames together. Be sure to choose an appropriate
relational join function as necessary.
Exercise 2
How does the percentage of cases in each term are decided by a one-vote margin
(i.e. 5-4, 4-3, etc.) change over time? Generate an appropriate visualization, then provide
a brief (no more than one paragraph) answer to the question. Utilize your graph to support
your answer.
Your visualization could look like this:
0%
10%
20%
30%
40%
50%
1800 1850 1900 1950 2000
Term
Percent of total cases decided
Percent of U.S. Supreme Court cases decided by 1-vote margin
Source: The Supreme Court Database
6
Ď Tip
Once again, render, commit, and push. Make sure that you commit and push all changed
documents and your Git pane is completely empty before proceeding.
Exercise 3
For justices currently serving on the Supreme Court, how often have they voted
in the conservative direction in cases involving criminal procedure, civil rights,
economic activity, and federal taxation? Generate an appropriate visualization, then
provide a brief (no more than one paragraph) answer to the question. Utilize your graph to
support your answer.
Ď Tip
• The Supreme Court’s website maintains a list of active members of the Court.
Retired justices should not be included in your analysis.
• Make sure to organize the resulting graph by justice in descending order of seniority.
Seniority is based on when a justice is appointed to the Court, so the justice who
has served the longest is the most “senior” justice.
• Note that the chief justice is always considered the most senior member of the court,
regardless of appointment date.
Your visualization might look like one of these:
7
BMKavanaugh ACBarrett
SSotomayor EKagan NMGorsuch
JGRoberts CThomas SAAlito
0% 25% 50% 75% 100% 0% 25% 50% 75% 100%
0% 25% 50% 75% 100%
Federal Taxation
Economic Activity
Civil Rights
Criminal Procedure
Federal Taxation
Economic Activity
Civil Rights
Criminal Procedure
Federal Taxation
Economic Activity
Civil Rights
Criminal Procedure
Percent of votes cast
Percent of cases decided in a conservative direction
U.S. Supreme Court
Source: The Supreme Court Database
Economic Activity Federal Taxation
Criminal Procedure Civil Rights
0% 25% 50% 75% 100% 0% 25% 50% 75% 100%
ACBarrett
BMKavanaugh
NMGorsuch
EKagan
SSotomayor
SAAlito
CThomas
JGRoberts
ACBarrett
BMKavanaugh
NMGorsuch
EKagan
SSotomayor
SAAlito
CThomas
JGRoberts
Percent of votes cast
Percent of cases decided in a conservative direction
U.S. Supreme Court
Source: The Supreme Court Database
8
Ď Tip
Once again, render, commit, and push. Make sure that you commit and push all changed
documents and your Git pane is completely empty before proceeding.
Exercise 4
In each term, how many of the term’s decisions (decided after oral arguments)
were announced in a given month? Generate an appropriate visualization, then provide
a brief (no more than one paragraph) answer to the question. Utilize your graph to support
your answer.
Ď Tip
• Most, but not all, of the Court’s decisions are published following a set of oral
arguments. One of the variables in the dataset indicates how the Court arrived
at its decision. Any case which is explicitly labeled as “orally argued” should be
included on this graph.
• The Supreme Court’s calendar runs on the federal government’s fscal year. That
means the frst month of the court’s term is October, running through September
of the following calendar year.
A plot similar to the one below would be ideal:
9
September
August
July
June
May
April
March
February
January
December
November
October
0 20 40 60 80
Number of decisions announced in a term-month
Number of decisions announced post-oral arguments per month, by term
U.S. Supreme Court
Source: The Supreme Court Database
Ď Tip
Render, commit, and push one last time. Make sure that you commit and push all
changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
• Go to http://www.gradescope.com and click Log in in the top right corner.
• Click School Credentials → Cornell University NetID and log in using your NetID credentials.
• Click on your INFO 2950 course.
• Click on the assignment, and you’ll be prompted to submit it.
10
• Mark all the pages associated with exercise. All the pages of your homework should be
associated with at least one question (i.e., should be “checked”).
Grading
• Exercise 1: 15 points
• Exercise 2: 10 points
• Exercise 3: 10 points
• Exercise 4: 15 points
• Total: 50 points
11