Numerical Linear Algebra with Applications

Numerical Linear Algebra with Applications S23
HW 4
Due March 20
1. This exercise involves Tikhonov regularization, which uses the error function
ET (v) = jjAv – yjj2 2 + λjjDvjj2 2;
where D is a given n × m matrix. Note that this reduces to ridge regularization if D = I.
(a) Show that the minimizer of ET (v) is found by solving (AT A + λDT D)v = AT y.
(b) Tikhonov regularization can be used to smooth the curve shown in Figure 1. The points on
this curve are (xi; yi), for i = 1; 2; · · · ; n, where the xi’s are equally space. Now, the curvature
at xi is approximately (yi+1 – 2yi + yi–1)=h2, where h is the xi spacing. Less noisy curves are
associated with smaller values of the curvature, which means we want to reduce the value of
jyi+1–2yi +yi–1j=h2. The objective is to find v that is close to y but which also has curvature
values that are not too big. Explain why this can be achieved by finding the minimizer of
ET (v), where A = I is n × n, and D is the (n – 2) × n matrix given as
D =
0BBBB@
1 –2 1
1 –2 1
.
.
.
.
.
.
.
.
.
1 –2 1
1CCCCA
:
Note that the h4 term has been absorbed into the λ.
For part (c) you need curve data, which you generate. The curve in Figure 1 was obtained
using the MATLAB commands (with n = 80):
xd=linspace(0,1,n); yd=exp(-10*(xd-0.6).^2).*cos(6*pi*xd); yd=yd+0.4*(rand(1,n)-0.5);
(c) On the same axes, plot the noisy and smoothed versions of the curve. What value of λ did
you use and why did you make this choice? In answering this question you should include
two additional plots: one when λ = λ=20 and one when λ = 20λ.
0 0.2 0.4 0.6 0.8 1
x-axis
-1
-0.5
0
0.5
1
y-axis
Figure 1: Noisy curve that is smoothed using Tikhonov regularization in Exercise 1.
MATLAB portion to turn in 1(c): The three plots and a copy of your code.
2. The question considered in this problem is: can you predict the number of new cases of COVID
each day in a county based on the number of new cases in the other counties of the state? To
answer this, suppose there are m counties in the state, and the respective number of new COVID
cases are x1, x2, · · · , xm. If xm refers to the county of interest, then the assumption (i.e., the model
function) is
xm = v1x1 + v2x2 + · · · + vm–1xm–1 :
a) Suppose the values of the xj’s are measured every day for n days. Let xij be the number of
new COVID cases for county xj on the ith day. The error function is still E(v) = jjAv – yjj2 2,
but now A is n×(m–1) and y is a n-vector. What are A and y in this problem in terms of the xij’s?
You need two data matrices T (the training data) and S (the testing data). How to obtain these
matrices is explained on the next page (which are for New York state with xm corresponding to
Rensselaer County).
b) Once T and S are known, then A=T(:,1:m-1); and y=T(:,m);. Compute v using the MoorePenrose pseudo-inverse.
c) For the testing data set, the predicted values for xm are: yp=S(:,1:m-1)v;. Also, the measured
values for xm are: ym=S(:,m);. Plot yp versus ym (this should be a scatter plot). Include in this
plot the 45o line (if the yp’s are correct they should lie on this line).
d) Redo parts (b) and (c) but use the QR approach to compute v.
e) The agreement between the measured values and the predicted values is really poor, really good,
or something in-between. Which is it? Try to explain this result. For example, is it realistic to
expect that you can predict what is happening on any given day in county xm using what is happening in other places on that same day across the state? Keep in mind that the counties in the
state vary significantly (some very rural, some densely populated, some with different rules about
quarantining, etc).
f) Comment on any differences between the predictions using the Moore-Penrose pseudo-inverse
and the QR approach.
MATLAB portion to turn in 2(b-d): The two plots, a copy of your code used to answer (b), and
the command window giving the v values.
Steps to obtain the data matrices T and S
In the COVID data spreadsheet, remove everything except the data for the counties for New York
state from 5/1/20 onwards. You are to then read the data into MATLAB, and then transform the
data into the format assumed in the model. A step by step approach to this is below.
Step 1: Either use the included spreadsheet (which was downloaded on 12/2/22) or download the
current data for CASES from the website:
https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
The rows in this spreadsheet list the counties for each state and the columns give the daily cumulative number of COVID cases for the respective county (starting with 1/22/20). Delete the columns
for all dates from 1/22/20 to 4/30/20 (so the first date is now 5/1/20). After this remove all rows
except those for the counties of NY. Let the row number for Rensselaer County be ic. With this,
then remove the first 3 columns so the first column now corresponds to the cases on 5/1/20. Save
the resulting spreadsheet as a tab delimited text” file. In what follows it’s assumed that this file
is named data.txt.
Step 2: Read in the data file into MATLAB using the command: D=importdata(’data.txt’);.
The columns of D are the daily cumulative number of cases in each county. You need N, which
is the matrix giving the number of new cases. If nj is the jth column of N, and dj is the jth
column of D, then nj = dj+1 – dj (so, N has one less column than D). In MATLAB this can
be accomplished using the command: N(:,j)=D(:,j+1)-D(:,j);. After determining N, take its
transpose, and then interchange column ic with column m (so now the data for Rensselaer County
is in the last column). Call the resulting matrix X.
Step 3: Split the rows of X into a training set T, which consists of a random selection of approximately 2/3 of the rows from X, and a testing set S which are the other 1/3. In MATLAB, this
can be done as follows:
rows=randperm(n);
nn=round(2*n/3);
T=X(rows(1:nn),:);
S=X(rows(nn+1:n),:);
Step 4: Center the data based on the daily average for the training data. The MATLAB commands
for this are:
meanT=mean(T);
nT=length(T(:,1));
for j=1:m

a
a
end

T(:,j)=T(:,j)-meanT(j)*ones(nT,1);
S(:,j)=S(:,j)-meanT(j)*ones(n-nT,1);