Optimization: Regression - Australia Assignments

UNCORRECTED PROOF
Chapter 8
Optimization: Regression
0 The problem central to this chapter, and the one that follows, is easy to state:
1 given a function F(v), find the point vm ∈ Rn where F achieves its minimum
2 value. This can be written as
F(vm) = min
v∈Rn
3 F(v). (8.1)
4 The point vm is called the minimizer or optimal solution. The function F(v)
5 is referred to as the objective function, or the error function, or the cost
6 function (depending on the application). Also, because the minimization is
7 over all of Rn, this is a problem in unconstrained optimization.
8 This chapter concerns what is known as regression, which involves finding
9 a function that best fits a set of data points. Typical examples are shown
10 in Figure 8.1. To carry out the regression analysis, you need to specify two
11 functions: the model function and the error function. With this you then need
12 to specify what method to use to calculate the minimizer.
Figure 8.1 Examples of regression fits from Schutt and Moseley [2020] (L), and
Lusso, E. et al. [2019] (R).
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
M. H. Holmes, Introduction to Scientific Computing and Data Analysis,
Texts in Computational Science and Engineering 13,
https://doi.org/10.1007/978-3-031-22430-0 8
1

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
2 8 Optimization: Regression
13 8.1 Model Function
14 We begin with the model function, which means we specify the function that
15 will be fitted to the data. Examples are:
16 • Linear Regression
17 g(x) = v1 + v2x (8.2)
18 g(x) = v1 + v2x + v3x2 + v4x3
19 g(x) = v1ex + v2e–x
g(x) = v1
1 + x
20 + v2√1 + x
21
22 • Non-linear Regression
23 g(x) = v1 + v2ev3x – asymptotic regression function (8.3)
g(x) = v1x
v2 + x
24 – Michaelis-Menten function (8.4)
g(x) = v1
25 26 1 + v2 exp(–v3x) – logistic function (8.5)
27 In the above functions, the vj’s are the parameters determined by fitting the
28 function to the given data, and x is the independent variable in the problem.
29 What determines whether the function is linear or nonlinear is how it depends
30 on the vj’s. Specifically, linear regression means that the model function g(x)
31 is a linear function of the vj’s. If you are unsure what this means, a simple
32 test for linear regression is that for every parameter vj, the derivative
∂g
33 ∂vj
34 does not depend on any of the vi’s. If even one of the derivatives depends on
35 one or more of the vi’s then the function qualifies for non-linear regression.
36 The question often arises as to exactly which model function to use. Typ-
37 ically, its dependence on x is evident from the data. For example, in Figure
38 8.1, the spindle length data looks to be approximately linear, and so one
39 would be inclined to use (8.2). The function also might be known from the-
40 oretical considerations. Examples of linear relationships as in (8.2) include
41 Hooke’s law, Ohm’s law and Fick’s law. The nonlinear functions can also
42 come from mathematical models, and because of this they are often given
43 names connected from the original problem they came from. For example,
44 the one in (8.4) gets its name from a model for an enzyme catalyzed reaction
45 in biochemistry known as the Michaelis-Menten mechanism Holmes 2019.
46 The one in (8.5) gets its name from the fact that it is the solution of the
47 logistic equation given in (7.6).

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.2 Error Function 3
48 8.2 Error Function
49 The next decision concerns the error, or objective, function, which is to be
50 used to fit the data. This is not a simple question and the choice can have a
51 significant affect on the final solution. To illustrate the situation, three data
52 points (x1, y1), (x2, y2), (x3, y3), and a potential linear fit to these points, are
53 shown in Figure 8.2. To measure the error, one can use the vertical distance
54 di, as in the left figure, or the true distances Di, as in the right figure. It is
55 easy to determine the vertical distance, and it is
56 di = |g(xi) – yi|.
57 The formula for the true distance depends on the model function. In what
58 follows we will concentrate on the linear function (8.2), in which case
59 Di = 1 + di v22 .
60 Both of these are used. The di’s are used in traditional least squares
61 data fitting, while the Di’s are used in what is called orthogonal regres-
62 sion. It’s necessary to be careful with the dimensional units using Di. For
63 example, the usual distance between two points requires the calculation
64 of D = (x0 – x1)2 + (y0 – y1)2. If the x’s are measured, say, in terms of
65 length, and the y’s are measured in terms of, say, time then the formula
66 for D has no meaning. The simplest way to avoid this problem is to scale,
67 or nondimensionalize, the variables before using orthogonal regression. An
68 example of this can be found in Section ??.
69 8.2.1 Vertical Distance
70 The usual reason for data fitting is to obtain a function g(x) that is capable
71 of giving a representative, or approximate, value of the data as a function of
72 x. This is a situation where using the vertical distances is appropriate. To
Figure 8.2 Different measures of error for the model function g(x) = v1 + v2x.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
4 8 Optimization: Regression
73 examine how this can be done, we start with the linear model g(x) = v1 + v2x.
74 Suppose there are three data points, as is illustrated in Figure 8.2. We want
75 to find the values of v1 and v2 that will minimize d1, d2 and d3. One way this
76 can be done is to use the maximum error, which is
E∞(v1, v2) = max{d1, d2, d3} = max
i=1,2,3
77 |g(xi) – yi|
78 79 = ||g – y||∞ , (8.6)
80 where y = (y1, y2, y3)T , g = (g(x1), g(x2), g(x3))T , and the ∞-norm is defined
81 in Section (3.5). Other choices are the sum
E1(v1, v2) = d1 + d2 + d3 =
3 i
=1
82 |g(xi) – yi|
83 84 = ||g – y||1 , (8.7)
85 and the sum of squares
E2(v1, v2) = d2 1 + d2 2 + d2 3 =
3 i
=1
86 [g(xi) – yi]2
87 88 = ||g – y||2 2 . (8.8)
89 An important observation here is that a vector norm provides a natural way
90 to measure the error.
91 Of the three error functions given above, E2 is the one most often used,
92 producing what is called least squares. This function has the advantage of
93 being differentiable if g(xi) is a differentiable function of the vj’s. This is
94 illustrated in Figure 8.3 in the case of when g(x) = v1 + v2x. The same data
95 set is used in each plot. It is evident that E∞ has edges and corners, while
96 E2 is smooth. Another observation is that the location of the minimizer for
97 E∞ is different than it is for E2. In other words, the best fit values for v1
98 and v2 depend on how you measure the error. This will be discussed in more
99 detail in Section 8.5.
100 8.3 Linear Least Squares
101 The assumption is that the model function g(x) depends linearly on the
102 parameters to be fitted to the data. For example, it could be g(x) = v1 + v2x,
103 g(x) = v1ex + v2e–x, or g(x) = v1 + v2x3. Because g(x) = v1 + v2x is probably
104 the most used model function involving least squares, we will begin with it.
105 In what follows, the data are (x1, y1), (x2, y2), · · · , (xn, yn). If the model
106 function has m parameters, then it is assumed that m ≤ n.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.3 Linear Least Squares 5
-4 -2 0 2 4 6
v
1-axis
-2
8

6 4 2 0
v
2-axis

-4 -2 0 2 4 6
v
1-axis
-2
8 Figure 8.3 Plots of E∞(v1, v2) and E2(v1, v2), both as a surface and as a contour plot,
when g(x) = v1 + v2x. The red * indicates the location of the minimizer.
107 8.3.1 Two Parameters
108 We begin with the case of when g(x) = v1 + v2x, which means that the asso-
109 ciated error function is
E(v1, v2) =
n i
=1
g(xi) – yi2 =
n i
=1
110 (v1 + v2xi – yi)2. (8.9)
111 To find the minimum, the first derivatives are
∂E
∂v1 = 2
n i
=1
112 (v1 + v2xi – yi) = 2nv1 + v2 i=1 n xi – i=1 n yi ,
113 and
∂E
∂v2 = 2
n i
=1
114 (v1 + v2xi – yi)xi = 2v1 i=1 n xi + v2 i=1 n x2 i – i=1 n xiyi .
115 Setting these to zero yields the system of equations

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
6 8 Optimization: Regression
116 nxi xx2 ii vv1 2 = xyiyi i . (8.10)
117 This system is called the normal equation for the vj’s. Using the formula for
118 the inverse matrix in (3.26), we have that
119 vv1 2 = n x2 i –1 xi 2 –xx2 ii –nxi xyiyi i . (8.11)
120 With this, we have solved the least squares problem when using a linear
121 model to fit the data.
122 Examples 8.1
123
124 To find the linear fit to the data in Table 8.1, note that n = 4, xi = 2,
125 x2 i = 6, yi = 3, and xiyi = –2. In this case, (8.11) gives
126 vv1 2 = 20 1 –6 2 4 –2–3 2
=
1
127 128 10 –11 7 .
129 The linear fit g(x) = 10 1 (11 – 7x), as well as the data, are shown in Figure
130 8.4.
131 A couple of mathematical comments are in order. First, the solution in
132 (8.11) requires that the denominator be nonzero. It is not hard to prove
133 that this holds except when all the xi’s are the same (also recall that we
134 are assuming n ≥ 2). Geometrically this makes sense because if all of the
135 data points fall on a vertical line, then it is impossible to write the line as
136 g(x) = v1 + v2x.
137 Another comment concerns whether the solution corresponds to the min-
138 imum value. It is possible to determine this using the second derivative test,
139 but it is more informative to look at the situation geometrically. The error
140 function E(v1, v2) in (8.9) is a quadratic function of v1 and v2. The assump-
141 tions made on the xi’s mean that E(v1, v2) is an elliptic paraboloid that opens
upwards (i.e., it’s an upward facing bowl). It looks similar to the surface for

xi	–1	2	0	1
yi	1	–1	2	1

Table 8.1 Data used in Example 8.1.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.3 Linear Least Squares 7
-1 -0.5 0 0.5 1 1.5 2
x-axis

-1

2 1
y-axis

0
Figure 8.4 Linear fit obtained using the least squares formula in (8.11).
142 E2 shown in Figure 8.3. In such cases, the critical point is the location of the
143 minimum point for the function. Looking at the error function as a surface in
144 this way can be used to derive some rather clever ways to find the minimizer,
145 and this is the basis of most of the methods derived in the next chapter.
146 8.3.2 General Case
147 The model function is now taken to have the general form
148 g(x) = v1φ1(x) + v2φ2(x) + · · · + vmφm(x), (8.12)
149 and the error function is
E(v1, v2, · · · , vm) =
n i
=1
150 g(xi) – yi2. (8.13)
151 Letting v = (v1, v2, · · · , vm)T and g = (g(x1), g(x2), · · · , g(xm))T , then g =
152 Av, where
A =
⎛⎜⎜⎜⎝
φ1(x1) φ2(x1) · · · φm(x1)
φ1(x2) φ2(x2) · · · φm(x2)
…
…
…
φ1(xn) φ2(xn) · · · φm(xn)
⎞⎟⎟⎟⎠
153 . (8.14)
154 In this case, the above error function (8.13) can be written as
155 E(v) = ||Av – y||2 2. (8.15)
156 There are various ways to find the v that minimizes E, and we will begin
with the most direct method. It is assumed in what follows that the columns

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8 8 Optimization: Regression
157 of A are independent, which means that the matrix has rank m. It is also
158 assumed that m ≤ n.
159 Normal Equation
160 Using the calculus approach, one simply calculates the first derivatives of E
161 with respect to the vj’s, and then sets them to zero. Using the gradient, the
162 equation to solve is ∇vE(v) = 0, where ∇v ≡ ∂∂v1 , ∂∂v2 , · · · , ∂v∂m T .
163 To determine ∇vE(v), recall that ||x||2 2 = x • x and the dot product can
164 be written as u • w = uT w. So,
165 ||Av – y||2 2 = (Av – y) • (Av – y)
166 = (Av)T (Av) – 2(Av)T y + y • y
167 168 = vT (AT A)v – 2vT AT y + y • y.
169 Given a m-vector b and a m × m matrix B, it is straightforward to show that
170 ∇v vT b = b and ∇v vT Bv = (B + BT )v. (8.16)
171 Therefore, since AT A is symmetric, ∇vE(v) = 0 reduces to solving
172 (AT A)v = AT y. (8.17)
173 This is the m-dimensional version of the matrix problem in (8.10). As before,
174 it is called the normal equation for v. Because the columns of A are assumed
175 to be independent, it is possible to prove that the symmetric m × m matrix
176 AT A is positive definite. This means that the normal equation can be solved
177 using a Cholesky factorization. However, as will be explained later, there are
178 better ways to compute the solution of (8.17).
179 Examples 8.2
180
181 To fit a parabola to the data set (–1, 2), (0, –1), (1, 1), (1, 3), the model
182 function is g(x) = v1 + v2x + v3x2. So, φ1 = 1, φ2 = x, and φ3 = x2. Since
183 x1 = –1, x2 = 0, x3 = 1, and x4 = 1, then, from (8.14),
A =
⎛⎜⎜⎝
1 x1 x2 1
1 x2 x2 2
1 x3 x2 3
1 x4 x2 4
⎞⎟⎟⎠
=
⎛⎜⎜⎝
1 –1 1
1 0 0
1 1 1
1 1 1
⎞⎟⎟⎠
184 .
185 Solving (8.17) it is found that v1 = –1, v2 = 0, and v3 = 3.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.3 Linear Least Squares 9
186 Examples 8.3
187
188 In the case of when
189 g(x) = v1 + v2x + · · · + vmxm–1, (8.18)
190 then
A =
⎛⎜⎜⎜⎝
1 x1 x2 1 · · · xm 1 –1
1 x2 x2 2 · · · xm 2 –1
…
…
…
…
1 xn x2 n · · · xm n –1
⎞⎟⎟⎟⎠
191 . (8.19)
192 This is fit to the data shown in Figure 8.5, taking m = 4, 8, 12, 16. Looking
193 at the resulting curves, it is evident that the fit for m = 8 is better than for
194 m = 4, and the method fails for m = 12 and m = 16. The reason for failure
195 is because AT A is ill-conditioned in these two cases, and to check on this
196 the condition numbers are given in Table 8.2. This is not a surprise because
197 A strongly resembles the Vandermonde matrix derived in Chapter 5 when
198 taking the direct approach for polynomial interpolation.
199 Although the ill-conditioned nature of the matrix in the above example
200 is due to its connection with the Vandermonde matrix, the normal equa-
201 tion has a propensity to be ill-conditioned for other model functions. As
202 an indication of how bad this is, if A is a square diagonal matrix then

0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1

6 0

6 0
12
y-axis
m = 4
12
m = 8
0 0.2 0.4 0.6 0.8 1
x-axis
6 0
12
y-axis
m = 12
0 0.2 0.4 0.6 0.8 1
x-axis
12
m = 16
Figure 8.5 Fitting data with the polynomial (8.18) using the normal equation approach
(green curve), and the QR approach (red curve). In the top row the two curves are indistinguishable.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
10 8 Optimization: Regression

m	κ2(AT A)	κ2(Rm)
4 8 12 16	8.87e+04 1.36e+12 6.57e+17 3.40e+19	3.38e+02 1.20e+06 1.14e+10 5.59e+14

Table 8.2 Condition number of the matrices that arise when fitting the polynomial (8.18)
to the data in Figure 8.5. The matrix Rm is obtained when using the QR approach, which
is explained in Section 8.3.3.
203 κ∞(AT A) = κ∞(A)2. So, if κ∞(A) = 108, then κ∞(AT A) = 1016. In this
204 case, even though A is not ill-conditioned, AT A certainly is. Fortunately,
205 there are other ways to solve the least squares problem, and one using a
206 QR factorization is discussed below. Another option is to use regularization,
207 which involves modifying the error function, and this is explained in Section
208 8.3.6.
209 8.3.3 QR Approach
210 It is possible to use the QR factorization to solve the least square problem.
211 To explain, as shown in Section 4.4.3, an invertible matrix B can be factored
212 as B = QR, where Q is an orthogonal matrix, and R is an upper triangular
213 matrix. The procedure used to find this factorization can be used on n × m
214 matrices if the matrix has m linearly independent columns. For the matrix
215 A in the least squares problem, this means that it is possible to factor it as
216 A = QR, where Q is an n × n orthogonal matrix and R is an n × m matrix
217 of the form
218 R = ROm , (8.20)
219 where Rm is an upper triangular m × m matrix and O is a (n – m) × m
220 matrix containing only zeros. With this, the least squares error function
221 becomes
222 E(v) = ||Av – y||2 2 = ||(QR)v – y||2 2
223 224 = ||Q Rv – QT y ||2 2.
225 Using the fact that orthogonal matrices have the property that ||Qx||2 =
226 ||x||2, it follows that
227 E(v) = ||Rv – QT y||2 2.
228 This can be rewritten as

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.3 Linear Least Squares 11
229 E(v) = ||Rmv – ym||2 2 + ||yn–m||2 2, (8.21)
230 where ym is the vector coming from the first m rows of QT y and yn–m is
231 the vector coming from the remaining n – m rows of QT y. The latter does
232 not depend on v. So, to minimize E(v) we need to minimize ||Rmv – ym||2 2.
233 This is easy as the minimum is just zero, and this happens when
234 Rmv = ym. (8.22)
235 The fact that Rm is upper triangular means that solving this equation is
236 relatively simple. It also needs to be pointed out that Rm is invertible as
237 long as the columns of A are independent.
238 One way to determine Rm is to apply the Gram-Schmidt procedure to the
239 columns of A to obtain a n × m matrix G. With this, ym = GT y and
240 Rm = GT A. (8.23)
241 This is usually the easiest way when solving the problem by hand (i.e., when
242 using exact arithmetic). When computing the solution, the Gram-Schmidt
243 procedure (either version) can have trouble producing an orthogonal matrix
244 when the condition number gets large (see Table 4.11 on page xxx). Conse-
245 quently, when floating-point arithmetic is involved, it is better to use the QR
246 factorization method described in Section 8.4. If this is done, so A = QR,
247 then the above formulas hold with G being the first m columns from Q.

248	Examples 8.4
249
250	For the A given in (8.26), applying the modified Gram-Schmidt procedure

251 to its columns, one obtains
G = 1
2
⎛⎜⎜⎝
1 –3/√5
1 3/√5
1 –1/√5
1 1/√5
⎞⎟⎟⎠
252 .
253 With this, and (8.23),
254 Rm = GT A = 2 1 0 √5 ,
255 and
256 ym = GT y = –7/3(2 /2√5) .
257 Solving Rmv = ym yields the same solution found in Example 8.1.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
12 8 Optimization: Regression
258 Example 8.3 (cont’d)
259 The model function is g(x) = v1 + v2x + · · · + vmxm–1, and the least squares
260 fits using the normal equation, and the QR approach, are shown in Figure
261 8.5. The two approaches produce basically the same result for m = 4 and
262 m = 8, but not for the two higher degree polynomials. Given the condition
263 numbers for Rm in Table 8.2, the QR solution is expected to be fairly accu-
264 rate for m = 12 and marginally accurate when m = 16. This is mentioned
265 because the significant over and under-shoots appearing in the QR solution
266 for m = 16 are not due to inaccuracies of the solution method but, rather, to
267 the model function. It is the similar to what happened when using a higher
268 degree polynomial for interpolation (e.g., see Figure 5.4 on page xxx). As
269 will be explained in Section 8.3.7, over- and undershoots like this are often
270 an indication that the data are being over-fitted.
271
272 8.3.4 Moore-Penrose Pseudo-Inverse
273 The assumption that the columns of A are independent means that AT A is
274 invertible. So, (8.17) can be written as
275 v = A+y, (8.24)
276 where
277 A+ ≡ (AT A)–1AT (8.25)
278 is known as the Moore-Penrose pseudo-inverse. It has properties similar to
279 those for an inverse matrix. For example, (A+)+ = A, (AB)+ = B+A+, and
280 if A is invertible then A+ = A–1.
281 Examples 8.5
282
283 In Example 8.1, φ1(x) = 1, φ2(x) = x, and from (8.14)
A =
⎛⎜⎜⎝
φ1(x1) φ2(x1)
φ1(x2) φ2(x2)
φ1(x3) φ2(x3)
φ1(x4) φ2(x4)
⎞⎟⎟⎠
=
⎛⎜⎜⎝
1 –1
1 2
1 0
1 1
⎞⎟⎟⎠
284 . (8.26)
285 Since
AT A = –1 1 1 1 1 2 0 1
⎛⎜⎜⎝
1 –1
1 2
1 0
1 1
⎞⎟⎟⎠
286 = 4 2 2 6 ,

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.3 Linear Least Squares 13
287 and
(AT A)–1 = 1
288 10 –31 2 –1 ,
289 it follows that
A+ = 1
290 10 –31 2 –1 –1 1 1 1 1 2 0 1 = 10 1 –4 1 3 2 3 3 3 1 .
291 It is easily verified that A+y yields the same solution as obtained in Example
292 8.1.
293 For by-hand calculations, (8.25), or the formula in Exercise 8.7(a) using
294 Gram-Schmidt to determine G, are usually the easiest ways to determine
295 A+. However, for computational problems it is better to either use the for-
296 mula in Exercise 8.7(a) using a Householder QR factorization (see Section
297 8.4) to determine G, or the SVD-based formula given in Exercise 8.7(b). If
298 Householder is used, then G is the first m columns from Q.
299 8.3.5 Overdetermined System
300 In data fitting the ideal solution is the one when the model function passes
301 through all data points. So, if g(x) = v1 + v2x, and the data points are
302 (x1, y1), (x2, y2), · · · , (xn, yn), then we would like
303 v1 + v2x1 = y1
304 v1 + v2x2 = y2

.. . = .. .	305	(8.27)
306	v1 + v2xn = yn.
307

308 Given that n > 2, we have more equations than unknowns. Consequently, it
309 is expected that there is no solution, and this means that the above system
310 is overdetermined.
311 Undeterred by the possibility that there is no solution, we write (8.27) as
312 Av = y, where
A =
⎛⎜⎜⎜⎝
1 x1
1 x2
…
…
1 xn
⎞⎟⎟⎟⎠
, v = vv1 2 , and y =
⎛⎜⎜⎜⎝
y1
y2
… yn
⎞⎟⎟⎟⎠
313 .

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
14 8 Optimization: Regression
314 Solving Av = y is equivalent to finding v that satisfies ||Av – y||2 2 = 0. With
315 data fitting, it is unlikely that there is a v that satisfies this equation. So,
316 instead, the goal is to find v that gets ||Av – y||2 2 as close to zero as possible
317 (i.e., it minimizes ||Av – y||2 2). This is the same function that appears in
318 (8.15), and so the solution is
319 v = A+y, (8.28)
320 where A+ is the Moore-Penrose pseudo-inverse defined in (8.25). In the ver-
321 nacular of the subject, (8.28) is said to be the solution of (8.27) in the least
322 squares sense, or the least squares solution.
323 Examples 8.6
324
325 Find the least-squares solution of
326 x – y + z = 2
327 x = –1
328 x + y + z = 1
329 330 x + y + z = 3
331 Answer: As shown in Exercise 8.7(a), A+ = R– m1GT . In this problem
A =
⎛⎜⎜⎝
1 –1 1
1 0 0
1 1 1
1 1 1
⎞⎟⎟⎠
, and y =
⎛⎜⎜⎝
2 –
1
13
⎞⎟⎟⎠
332 .
333 Applying Gram-Schmidt to the columns of A we get that
G =
⎛⎜⎜⎜⎜⎝
12
–
5
2√11
√22
11
12
–
1
2√11 –
2√22
11
12
3
2√11
1
√22
12
3
2√11
1
√22
⎞⎟⎟⎟⎟⎠
, and so Rm = GT A =
⎛⎜⎝
2 1
2
32
0 √11
2
1
2√11
0 0 2√22
11
⎞⎟⎠
334 .
335 After finding R– m1 you get that
A+ = R–1
336 m GT = ⎛ ⎝–0 1 0 0 1 21 2 –01 1 4 1 4 14 14 ⎞ ⎠ .
337 Evaluating A+y one finds that x = –1, y = 0, z = 3, which is equivalent to
338 the solution found in Example 8.2.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.3 Linear Least Squares 15
339 8.3.6 Regularization
340 Both fitting methods in Figure 8.5 produce unwieldy polynomials for larger
341 values of m. For example, when m = 16 the QR approach results in a polyno-
342 mial of the form (4 × 105)x15 – (2 × 107)x14 + (3 × 108)x13 – (3 × 109)x14 +
343 · · · . Such large coefficients, with alternating signs, means that the resulting
344 model function is very sensitive to both how accurately the vj’s are computed
345 as well as any potential errors in the data.
346 A way to reduce the likelihood of having this happen is to modify the error
347 function, and include a penalty for large values for the vj’s. The usual way
348 this is done is to take
349 ER(v) = ||Av – y||2 2 + λ||v||2 2, (8.29)
350 where λ > 0 is a parameter that you select. So, the larger the value of ||v||2,
351 the larger the penalty term contributes to the error function. This idea is
352 called regularization.
353 With (8.29), employing the same argument used to derive the normal
354 equation, you find that instead of (8.17), you now get
355 (AT A) + λIv = AT y. (8.30)
356 This is referred to as the regularized normal equation. So, regularization has
357 resulted in a minor change in the equation that must be solved. Moreover,
358 AT A + λI is not just invertible, it is positive definite (see Exercise 4.5) and
359 typically has a better condition number than AT A.
360 The big question is how the modified error function affects the fit to the
361 data. One way to answer this is to note that ER(v) approaches the error
362 function used in least squares as λ → 0. So, it is reasonable to expect that
363 the fit is similar to that obtained using least squares for small values of λ.
364 Another way to answer the question is to try it on our earlier example where
365 the normal equation approach failed.
366 Example 8.3 (cont’d)
367 The model function is g(x) = v1 + v2x + · · · + vmxm–1, and the least squares
368 fits using the normal equation, and the QR approach, are shown in Figure 8.5.
369 We are interested in m = 12 and m = 16 as these are values that the normal
370 equation has trouble with. Taking λ = 10–4, the fits using the regularized
371 normal equation are shown in Figure 8.6. The improvement is dramatic. The
372 regularized fit provides what looks to be a reasonable fit to the data and does
373 not contain the over- and undershoots that the QR solution has. Moreover,
374 the condition number for AT A + λI, for both m = 12 and m = 16, is about
375 5 × 104. In other words, it is better conditioned than the matrix used in the

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
16 8 Optimization: Regression
0 0.2 0.4 0.6 0.8 1
x-axis

6 y-axis	6
0	0

12
m = 12
0 0.2 0.4 0.6 0.8 1
x-axis
12
m = 16
Figure 8.6 Fitting data with the polynomial (8.18) using the regularized normal equation
approach (8.30), by the black curves, and the QR approach (8.22), by the red curves. This
is the same data set used in Figure 8.5.
376 QR approach (see Table 8.2). Finally, you should also note that the two green
377 curves are very similar. The significance of this will be discussed in Section
378 8.3.7.
379
380 The regularization used here is known as ridge regularization. Other norms
381 can be used, and one possibility is to replace λ||v||2 2 with λ||v||1. This is called
382 lasso regularization. It is also possible to generalize this idea to get what is
383 called Tikhonov regularization, and this is explained in Exercise 8.14.
384 What is not obvious is what value to use for the regularization parameter
385 λ. One approach is to simply pick a few values and see what produces an
386 acceptable result (which is the method used in the above example). A more
387 systematic approach involves something called cross-validation and this will
388 be discussed a bit more in the next section.
389 The idea underlying regularization is an old one, and arises in other areas
390 in optimization. For example, it is used in the augmented Lagrangian method.
391 It is also related to the method of Lagrange multipliers, although how the
392 value of λ is determined is modified slightly.
393 8.3.7 Over-fitting the Data
394 We need to return to the question of picking the model function. In Example
395 3, the polynomial g(x) = v1 + v2x + · · · + vmxm–1 was considered. The data
396 used to do this, which are shown in Figures 8.5 and 8.7(top), are known
397 as the training data because they are used to determine the vj’s. What is
398 seen in Figure 8.7(bottom), by the blue curve, is that the larger you take
399 m, the better the fit (using the QR approach). In other words, the fitting
400 error decreases with m. If the smallest fitting error possible is your goal, then
401 presumably you should keep increasing m until it is achieved (assuming that

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.3 Linear Least Squares 17

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x-axis
5 0
y-axis

50
100
150
200
Error

Training

Testing

10
Training Data
Testing Data
2 4 6 8 10 12 14 16
m-axis
0
Figure 8.7 Top: Training data and the testing data for Example 3. Bottom: The error
E(v) = ||Av – y||2 2 on the training data set, and the error using the trained model function
on the testing data set.
402 this is even possible computationally). In regression problems, which typically
403 involve noisy data, this usually leads to what is called over-fitting the data.
404 To explain why over-fitting the data is something to be avoided, fitting
405 a model function to a given data set is referred to as training. The usual
406 statement is that the optimal parameters are obtained by training the model
407 on the data. Now, suppose the experiment used to produce the training data
408 is run again, producing a new data set. This is known as testing data, and an
409 example is shown in Figure 8.7(top). The data are obtained by rerunning the
410 same code that produced the data in Figure 8.5. So, they are obtained using
411 the same generating function, but appear different because the 20 random xi
412 points are slightly different, and the noise is slightly different. The expectation
413 is that the trained model function should fit the testing data about as well
2 4 6 8 10 12 14 16
m-axis
0

50
100
Error

Trainin
Testing

Figure 8.8 The resulting fitting error E(v) = ||Av – y||2 2 using the regularized solution
on the training data set (blue curve), and on the testing data set (red curve).

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
18 8 Optimization: Regression
414 as it does the training data. As shown in Figure 8.7(bottom), by the red
415 curve, for smaller values of m the fit to the testing data does improve with
416 m. However, unlike with the training data, for larger values of m the error
417 gets worse. The reason is that some of the testing data are falling in those
418 regions where the model function has significant under or over-shoots, and
419 this results in large errors (see Figure 8.6).
420 We are in the proverbial Goldilocks situation, where we want the model
421 function to be able to describe the qualitative behavior seen in the data, not
422 being too simple or too complex. But, determining what is “just right” can
423 be difficult. As it turns out, regularization can help with this decision. If you
424 look at Figure 8.6 it is evident that the two regularized solutions (i.e., the
425 two black curves) do not have over- and undershoots. Therefore, they should
426 be less susceptible to overfitting. The verification of this is given in Figure
427 8.8. To explain this plot, given m, the fitting parameters are obtained by
428 minimizing ER(v) given in (8.29). What we are interested in is the resulting
429 fitting error E(v) = ||Av – y||2 2. The resulting E(v) for the training data
430 is the blue curve in Figure 8.8. The red curve is the resulting E(v) for the
431 testing data. Unlike what happens in Figure 8.7(bottom), the regularized
432 solution shows a similar dependence on m for the testing data as it does for
433 the training data.
434 The question of what is the preferred model function still has not been
435 answered. For the polynomial functions g(x) = v1 + v2x + · · · + vmxm–1,
436 based on the training curve in Figure 8.8, you would take m = 6. The rea-
437 son is that any larger value of m (i.e., a more complex function) results in
438 almost no improvement in the fitting error. This conclusion is also consistent
439 with what happens with the testing curve. It should be pointed out that this
440 conclusion is relative to having taken λ = 10–4.
441 Using training and testing sets is a form of cross-validation. Because over-
442 fitting, and other choices that might bias the results, are ubiquitous in regres-
443 sion, numerous ways have been developed to address these problems. Besides
444 cross-validation [James et al., 2021; Liu and Dobriban, 2020], there is also
445 data augmentation and ensembling (such as boosting and bagging), etc. More
446 about these can be found in Ghojogh and Crowley [2019].
447 8.4 QR Factorization
448 The QR factorization came up when solving eigenvalue problems in Section
449 4.4.4, as well as solving least square problems in the previous section. What
450 is considered now is a better way to find the factorization than using Gram-
451 Schmidt. The result will be a procedure that is able to derive a QR factor-
452 ization for any matrix. And, unlike the LU factorization, it does not require
453 pivoting to work. What is does require is more computational work, approx-
454 imately twice that of LU.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.4 QR Factorization 19
455 The objective is, given a n × m matrix A, find a factorization of the form
456 A = QR, (8.31)
457 where Q is an n × n orthogonal matrix, and R is an n × m upper triangular
458 matrix. The exact form for R depends on the number of columns and rows
459 in A. For example, the following are possible:
R =
⎛⎜⎜⎜⎜⎝
x x x
0 x x
0 0 x
0 0 0
0 0 0
⎞⎟⎟⎟⎟⎠
, R =
⎛⎜⎜⎝
x x x x
0 x x x
0 0 x x
0 0 0 x
⎞⎟⎟⎠
, R =
⎛⎜⎜⎝
x x x x x
0 x x x x
0 0 x x x
0 0 0 x x
⎞⎟⎟⎠
460 .
461
462 m < n m = n m > n
463 In the above matrices, a x simply indicates a location for a possible nonzero
464 entry.
465 It should be pointed out that Q and R are not necessarily unique. We are
466 not looking for a way to find all of them, all we want is a method to find just
467 one of them.
468 8.4.1 Finding Q and R
469 The approach for finding the factorization is to construct orthogonal matrices
470 Q1, Q2, Q3, · · · so that the sequence R0 = A,
R1 = Q1R0 =
⎛⎜⎜⎜⎜⎜⎝
r11 r12 r13 · · · r1m
0 x x · · · x
0 x x · · · x
…
…
…
…
0 x x · · · x
⎞⎟⎟⎟⎟⎟⎠
471 , (8.32)
472
R2 = Q2R1 =
⎛⎜⎜⎜⎜⎜⎝
r11 r12 r13 · · · r1m
0 r22 r23 · · · r2m
0 0 x · · · x
…
…
…
…
0 0 x · · · x
⎞⎟⎟⎟⎟⎟⎠
473 , (8.33)
474 etc, will determine R after no more than m steps. So, what Q1 must do is
475 produce the zeros in the first column as indicated in (8.32). What Q2 must
476 do is produce the zeros in the second column, as indicated in (8.33), but also
477 preserve the first row and column in R1. A key observation is that if we can
478 find a way to determine Q1, then we can use the same method to determine
479 Q2. The reason is that the (n – 1) × (m – 1) block of x’s in (8.32) that Q2

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
20 8 Optimization: Regression
Figure 8.9 To transform c1, the blue vector, to c e1, the green vector, you can use a
rotation, on the left, or a reflection, on the right.
480 must operate on is just a smaller version of the n × m matrix that Q1 must
481 work on.
482 Finding Q1 and R1
483 The goal is to find a n × n orthogonal matrix Q1 so that

R1 = Q1A =

…
0 x x · · · x

⎛⎜⎜⎜⎝
r11 r12 r13 · · · r1m
0 x x · · · x
…
…
…
⎞⎟⎟⎟⎠
484 (8.34)
485 More specifically, if c1 is the first column of A, then we want to find Q1 so
486 that
Q1c1 =
⎛⎜⎜⎜⎝
r11
0…0
⎞⎟⎟⎟⎠
487 .
488 In what follows, this will be written as
489 Q1c1 = c e1, (8.35)
490 where e1 = (1, 0, 0, · · · , 0)T . The constant c in this equation is known. This
491 is because orthogonal matrices have the property that ||Qx||2 = ||x||2. Con-
492 sequently, from (8.35), |c| = ||c1||2. In other words, we can take c = ||c1||2,
493 or c = -||c1||2. Both work, and which one is used is not important at the
494 moment.
495 It should be mentioned that if c1 already has the form c e1, then there is
496 nothing to do and we simply take Q1 = I. In what follows it is assumed that
497 this is not the case.
498 Finding Q1 is not hard if you remember that orthogonal matrices are asso-
499 ciated with rotations and reflections. To take advantage of this, the equation
500 (8.35) is illustrated in Figure 8.9 for n = 2, where c1 is shown in blue and

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.4 QR Factorization 21
501 c e1 is shown in green. So, one possibility is to rotate c1 down to c e1, which
502 is shown on the left in Figure 8.9. Another choice, shown on the right, is a
503 reflection through the line that bisects the two vectors. Both methods are
504 used. Rotations give rise to what is known as Givens method. It turns out
505 that, from a computational point of view, it is better to use reflections. The
506 particular method that will be derived is known as Householder’s method.
507 Assuming the bisecting line is y = αx, then a straightforward calculation
508 shows that the reflection shown in Figure 8.9 is obtained using
Q1 = 1
509 1 + α2 1 –2α α α2 22–α 1 . (8.36)
510 This can be written as Q1 = I – P1, where
P1 = 2
511 1 + α2 –αα2 –1α .
512 Introducing the vector v = (–α, 1)T , then
P1 = 2
v • v
513 vvT ,
514 where vvT is an outer product. What is significant here is that the vector v
515 is orthogonal to the bisecting line (see Figure 8.9).
516 Householder Transformation
517 The orthogonal matrix Q1 in (8.36) is an example of what is called a House-
518 holder transformation. The definition of a Householder transformation H in
519 Rn is that it has the form
H = I – 2
v • v
520 vvT , (8.37)
521 where v is nonzero. This is the orthogonal matrix that corresponds to a
522 reflection through the hyperplane that contains the origin and has v as a
523 normal vector. In R2 a hyperplane is a line, and in R3 it is a plane.
524 We will satisfy (8.35) by taking Q1 = H, where
525 v = c1 – c e1. (8.38)
526 In this expression you can take c = ||c1||2 or c = -||c1||2 (both work). To
527 verify that v is normal to the bisecting hyperplane, note that because c1
528 and c e1 have the same length, then the vector c1 + c e1 is on the diagonal
that bisects c1 and c e1. It is easy to verify that v • (c1 + c e1) = 0. Since v

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
22 8 Optimization: Regression
529 is in the plane determined by c1 and c e1, it follows that it is normal to the
530 bisecting hyperplane.
531 With this choice for Q1, we have that
R1 = Q1A =
⎛⎜⎜⎜⎝
c x x · · · x
0 x x · · · x
…
…
…
…
0 x x · · · x
⎞⎟⎟⎟⎠
=
⎛⎜⎜⎜⎝
r

11	r12 r13 · · · r1m
0… 0	A1

⎞⎟⎟⎟⎠
532 . (8.39)
533 The first row of R1 is the first row in R, and A1 is a (n – 1) × (m – 1)
534 matrix.
535 Finding Q2 and R2
536 Our task now is to find an orthogonal matrix Q2 so that R2 = Q2R1 has
537 the form given in (8.33). To preserve the first row and column in R1, the
538 orthogonal matrix Q2 is taken to have the form
Q2 =
⎛⎜⎜⎜⎝
1 0 0

· · ·

0
0…
0
⎞⎟⎟⎟⎠
539
540 where H2 is the (n – 1) × (n – 1) Householder matrix for the submatrix A1
541 given in (8.39). This means we want H2c1 = c e1, where c1 is now the first
542 column of A1. This is a smaller version of the problem we solved to find Q1.
543 As before, H2 is given in (8.37) with v = c1 – c e1, and this yields
H2A1 =
⎛⎜⎜⎜⎝
c x x · · · x
0 x x · · · x
…
…
…
…
0 x x · · · x
⎞⎟⎟⎟⎠
=
⎛⎜⎜⎜⎝
r

22	r23 r24 · · · r2m
0… 0	A2

⎞⎟⎟⎟⎠
544 . (8.40)
545 The first row of H2A1 is inserted into R2 as indicated in (8.33).
546 Finding Q and R
547 The steps outlined above are continued until all the columns of R are deter-
548 mined. To write out the general step, suppose Aj–1 is the submatrix in Rj–1
549 (where A0 = A). Letting Hj be the Householder matrix for Aj–1, then

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.4 QR Factorization 23
HjAj–1 =
⎛⎜⎜⎜⎝
c x x · · · x
0 x x · · · x
…
…
…
…
0 x x · · · x
⎞⎟⎟⎟⎠
=
⎛⎜⎜⎜⎝
r

jj	rj,j+1 rj,j+2 · · · rjm
0… 0	Aj

⎞⎟⎟⎟⎠
550 . (8.41)
551 The associated orthogonal matrix Qj is
552 Qj = Ij0 H –1 0j , (8.42)
553 where Ij–1 is the (j – 1) × (j – 1) identity matrix (with Q1 = H1). Although
554 it is probably obvious, it’s worth mentioning that the vector e1=(1, 0, · · · , 0)T
555 used to determine Hj has the same size as the first column in Aj–1.
556 Now, suppose all of the required columns have been zeroed out, and QK
557 is the last orthogonal matrix used to accomplish this. This means that R =
558 QKRK–1, and this can be written as
559 R = QKQK–1 · · · Q1A.
560 The above expression is useful for determining Q, since
561 A = (QKQK–1 · · · Q1)–1R
562 = (Q– 1 1 · · · Q– K1–1Q– K1)R
563 = QR,
564
565 where
566 Q = Q1 · · · QK–1QK. (8.43)
567 The last expression follows because the Qj’s are symmetric and orthogonal,
568 so that Q– j 1 = QT j = Qj.
569 As mentioned earlier, you can use c = ||c1||2 or c = -||c1||2 when finding
570 the Qj’s. In the following example the positive value is used. When using
571 floating-point arithmetic, there is a possibility that c1 is very close to being a
572 multiple of e1. If this happens then it’s possible that v, as defined in (8.38),
573 is rounded to zero. To prevent this from happening, in most computer codes,
574 the sign of c is chosen to be the opposite of the sign of c11 (the first entry in
575 c1).
576 Examples 8.7
577
578 Find a QR factorization of
579 A = ⎛ ⎝–0 2 1 1 1 1 0 2 –1⎞ ⎠ . (8.44)

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
24 8 Optimization: Regression
580 Finding R1 : Since c1 = ⎛ ⎝–0 11⎞ ⎠ then c = ||c1||2 = √2 and, from (8.38),
v = c1 – c e1 =
⎛⎝
01 –
1
⎞⎠
581 – √2 ⎛ ⎝1 0 0⎞ ⎠ = ⎛ ⎝–√112⎞ ⎠ .
582 So,
583 vvT = ⎛ ⎝-√√222 1 –√1 1 2 √-12⎞ ⎠ ,
584 and, from (8.37),
585 H1 = ⎛ ⎝1 0 0 0 1 0 0 0 1⎞ ⎠ – 12 ⎛ ⎝-√2√22 1 –√1 1 2 -√12⎞ ⎠ = ⎛ ⎝

–1 21 2√√22 1 2 1 2

01 2√2 – 1 212 12√2⎞ ⎠ 586 With this we have that
587 R1 = H1A = ⎛ ⎝√002 1 2 1 2 + –1 2√√ √22 2 3 2 3 2— +1 2√1 2 1 2√ √22 2⎞ ⎠ .
588 Finding R2 : The lower 2 × 2 block from R1 is
589 A1 = 1 2 1 2 + – √ √2 2 3 2 3 2 – + √ √2 2 .
590 Now, c1 = 1 2 1 2 + – √ √2 2 , which means that c = ||c1||2 = 3/√2 and
591 v = 1 2 1 2 + – √ √2 2 – 3 2√2 1 0 = 12 11—2√√22 .
592 From this we have
vvT = 1
593 4 35 — 23√√2 5 2 9 – – 3 4√ √2 2 ,
594 and
595 H2 = 1 0 0 1 – v2· vvvT = 1 6 –4 + 4 +√√22 – –4 + 4 – √ √2 2 .

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.4 QR Factorization 25
596 Moreover,
597 Q2 = ⎛ ⎝1 0 0 00 H2⎞ ⎠ = ⎛ ⎝1 0 0 0 2 0 –2//3 + 3 +√√22//66 – –2 2/ /3 + 3 – √ √2 2/ /6 6⎞ ⎠
598 Therefore,
599 R = Q2R1 = ⎛ ⎝√0 0 0 2 1 2 3 2√ √2 2 – —1 2 1 6√ √7 3 2 2⎞ ⎠ .
600 Finding Q : Using (8.43),
601 Q = Q1Q2 = ⎛ ⎝–1 221√0√22 2 3 1 6 1 6√ √ √2 2 2 – –1 32 3 2 3⎞ ⎠ .
602 8.4.2 Accuracy
603 It is informative to experiment with the QR factorization, as there are a
604 few surprises, both good and bad. Of interest is how accurately Q and R
605 are computed. The complication is that a QR factorization is not unique.
606 It is, however, possible to prove that if the diagonals of R are required to
607 be positive then the factorization is unique. So, in this example, a random
608 n × n orthogonal matrix Q is generated along with a random n × n upper
609 triangular matrix R that has positive diagonals. With this, the matrix is
610 taken to be A = QR. The QR factorization of A is then computed using
611 the Householder method, giving Qc and Rc, where the diagonals of Rc are
612 positive. The resulting differences between the exact and computed values
613 are given in Table 8.3. The fifth column in this table is used to check on the
614 orthogonality of Qc, as explained in Example 4.12 on page xxx.
615 Two matrices are tested: one which is well conditioned for all values of
616 n, and one for which the condition number becomes large as the matrix size
617 increases. From the second column in Table 8.3 it is seen is that the product
618 QcRc is computed accurately irrespective of the condition number or the size
619 of the matrix. However, from the third and fourth columns, it is evident that
620 Qc and Rc are computed accurately for the well conditioned matrices but
621 not so for the more ill-conditioned matrices. What is surprising about this is
622 that even as the accuracy of Qc and Rc decreases, their product remains very
623 accurate. Moreover, as indicated in column 5, in all cases Qc is an orthogonal
624 matrix (to machine precision).

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
26 8 Optimization: Regression

n	\|\|QR – QcRc\|\|∞ \|\|A\|\|∞	\|\|Q – Qc\|\|∞	\|\|R\|\|-RR\|\|∞c\|\|∞	\|\|QT c Qc – I\|\|∞	κ∞(A)
Well
100	8.1e–16	8.9e–16	4.2e–15	8.8e–15	8e+01
500	1.1e–15	1.1e–15	1.2e–14	2.7e–14	4e+02
1000	1.3e–15	1.3e–15	1.9e–14	4.5e–14	7e+02
2000	1.5e–15	1.5e–15	2.9e–14	7.2e–14	1e+03
4000	1.6e–15	1.6e–15	4.3e–14	1.1e–13	3e+03
Ill
100	7.8e–16	1.4e–15	1.1e–15	9.7e–15	7e+02
500	7.5e–16	1.3e–14	3.7e–15	2.6e–14	2e+05
1000	8.6e–16	2.3e–13	1.2e–13	4.5e–14	3e+07
2000	8.4e–16	1.8e–10	7.4e–11	7.2e–14	2e+11
4000	1.3e–15	5.2e–04	1.2e–04	1.1e–13	4e+18

Table 8.3 Accuracy in computing the QR factorization for a matrix that is wellconditioned (Well), and one that is ill-conditioned (Ill).
625 8.4.3 Parting Comments
626 The QR factorization algorithm, unlike the one for a LU factorization, has no
627 need to introduce something like pivoting to be able to find the factorization.
628 This is reflective of the fact that you can prove that any matrix has a QR
629 factorization, with no pivoting needed [Golub and Van Loan, 2013].
630 In considering the possible rotations and/or reflections that might be used
631 to transform the first column of A to a multiple of e1, Givens method was
632 mentioned. While Householder uses reflections, Givens uses rotations to find
633 the Qj’s and this ends up requiring more flops (about 50% more). Also, it
634 is relatively easy to code Householder to take advantage of the optimized
635 operations available in BLAS. What this means is that for the current gener-
636 ation of multi-core processor computers, Householder is the usual way a QR
637 factorization is computed.
638 8.5 Other Error Functions
639 In some problems using the vertical distance between the data point and
640 model function is not possible, or relevant to the questions being asked. To
641 explore what happens when using other error functions, the one we have been

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.5 Other Error Functions 27
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x-axis
5

4 3 2 y-axis
1	E D
E 2

Figure 8.10
line.

Linear fit obtained using ED, solid red line, and using E2, the dashed black

642 considering is
643 E2 = ||g – y||2 2 .
644 An alternative is to use the true distance, as illustrated in Figure 8.2. Using
645 least squares, with g(x) = v1 + v2x, the error function is
646 ED = Di2 (8.45)
=
1
647 648 1 + v22 ||g – y||2 2 . (8.46)
649 In the case of when g(x) = v1 + v2x, finding the values of v1 and v2 that
650 minimizes ED reduces to solving a cubic equation. This can be solved using
651 a nonlinear equation solver, such as Newton’s method. Another option is to
652 compute the minimizer using a descent method, which are derived in the next
653 chapter. In the example below, the values are calculated using the Nelder-
654 Mead algorithm, which is described in Section 9.6.
655 Examples 8.8
656
657 Shown in Figure 8.4 are data, along with the linear fit using the error function

v1	v2
E2	1.16	2.93
ED	0.60	4.12
E1	1.51	2.49
E∞	1.04	3.19

Table 8.4 Fitted values for g(x) = v1 + v2x for different error functions using the data
shown in Figure 8.10.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
28 8 Optimization: Regression

4 -2
x

0 2 4
-axis

-4
-2
4 2 0
y-axis

Figure 8.11
squares.

Data to be fitted by a circle, and the resulting circle obtained using least

658 ED and the linear fit using the error function E2. The corresponding values
659 of the parameters v1 and v2 are given in Table 8.4. The values using E1 and
660 E∞ are also given. The fact that the lines in the figure, or the values given in
661 the table, are different is no surprise. If you use a different error function you
662 should expect a different answer for the minimizer. Given the variation in the
663 computed values, it is important to use an error function that is relevant to
664 the application being considered.

665	Examples 8.9
666
667	As a second example, consider the data shown in Figure 8.11. The objective

668 is to find the circle x2 + y2 = r2 that best fits this data, with the parameter
669 in this case being the radius r. The first step is to specify the error function.
670 This is a situation where using the vertical distance runs into some serious
671 complications. This is because for any data point to the right (so, xi > r) or
672 left (so, xi < –r) of the circle, the vertical distance is not defined. Conse-
673 quently, it is more appropriate to use the radial distance between the curve
674 and the data. So, given a data point (xi, yi), the corresponding radial coor-
675 dinates are (ri, θi), where ri2 = x2 i + yi2. The least squares error based on the
676 radial distance is
E(r) =
n i
=1
677 (r – ri)2. (8.47)
678 Taking the derivative of this with respect to r, setting the result to zero, and
679 solving yields the solution

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.6 Nonlinear Regression 29
r =
1 n
n i
=1
680 ri. (8.48)
681 This answer is one that you might have guested before undertaking the regres-
682 sion analysis because it states that the circle that best fits the data is the one
683 whose radius is the average of the radii from the data. A comparison between
684 the resulting circle and the data is shown in Figure 8.11.
685 In Section 8.1 these was a discussion of how a model function might be
686 selected, or preferred, for a particular problem. The same questions arise for
687 the error function, but in this case they are a bit harder to answer. It depends
688 on the problem being solved. As an example, for the error function used in
689 a facial recognition algorithm, you want it to identify the person irrespective
690 if they are upside down, right-side up, or looking in a mirror. What you are
691 asking is that the error function is invariant to rotations or reflections of the
692 data. As it turns out, E2 does not do this but ED does. The point here is
693 that the invariance properties of the error function can play a role in deciding
694 which to use, and more about this can be in Holmes and Caiola [2018].
695 8.6 Nonlinear Regression
696 To illustrate some of the complications that arise with nonlinear regression,
697 we will consider the data shown in Figure 8.12, and given in Table 8.5, which
698 is from Smith et al. [2010]. Based on what they refer to as a model for simple
699 hyperbolic binding, they assumed the data can be fit with the function
g(x) = v1x
v2 + x
700 . (8.49)
701 This is known as the Michaelis-Menten model function, and it is used exten-
702 sively in biochemistry. Using a least squares error function, then
E(v1, v2) =
n i
703 =1v2v1+xixi – yi 2 . (8.50)
704 As is typical with nonlinear regression, it is not possible to find a formula
705 for a minimizer of E(v1, v2) as we did for linear regression. However, there are
706 various ways to compute a minimizer. One option is to use a decent method,
707 and these are the subject of the next chapter (of particular relevance is the
708 Levenberg-Marquardt method discussed in Section 9.5). A second option is
709 to set the first partial derivatives of E to zero, and then solve the resulting
710 nonlinear equations using Newton’s method. A third option is to transform
711 the problem into one for linear regression, and then find a formula for the

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
30 8 Optimization: Regression
712 solution. This is the option we will consider here. If you think that this
713 contradicts the first sentence of this paragraph, it does not and this will be
714 explained in the development.
715 Examples 8.10
716
717 Some nonlinear regression problems can be transformed into linear ones,
718 although this usually only works when the model function is not too compli-
719 cated. To illustrate, the Michaelis-Menten function, given in (8.49),
y =
v1x
v2 + x
720 ,
721 can be written as
1 y
=
v2 + x
v1x
722
=
v2
v1
1 x
+
1 v1
723 .
724
725 Setting Y = 1/y and X = 1/x, then the above model function takes the form
726 G(X) = V1 + V2X , (8.51)
727 where V1 = 1/v1 and V2 = v2/v1. If (xi, yi) is one of the original data points,
728 then the corresponding point for the transformed problem is (Xi, Yi) =
729 (1/xi, 1/yi).
730 Using least squares, then the error function is
E(V1, V2) =
n i
=1
731 (V1 + V2Xi – Yi)2. (8.52)
732 According to (8.11), the resulting solution is
Figure 8.12 Data from Smith et al. [2010].

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.6 Nonlinear Regression 31

xi	0.1	0.5	1	2	5	10
yi	0.5	1.8	3.4	6.1	8.7	11.1
Xi	10	2	1	0.5	0.2	0.1
Yi	2	0.5556	0.2941	0.1639	0.1149	0.0901

Table 8.5 Data for Example 8.10.
733 VV1 2 = n Xi2 –1 ( Xi)2 –XXii2 – nXi XYii Yi . (8.53)
734 The last step is to transform back, using the formulas v1 = 1/V1 and v2 =
735 V2/V1. With this, we have been able to find a solution without the need for
736 Newton’s method.
737 The linearizing transformation approach is widely used but it has a poten-
738 tial, rather serious, flaw. To illustrate what this is, if you use the data in
739 Table 8.5, then the curve shown in Figure 8.13 is obtained. The poor fit
740 occurs because, for this particular model function,
g(xi) – yi = – 1
741 YiG(Xi)G(Xi) – Yi .
742 So, having G(Xi) – Yi close to zero does not necessarily mean g(xi) – yi is
743 close to zero, and the reason is the multiplicative factor 1/(YiGi). As an exam-
744 ple, the last data point in Figure 8.13 is (x6, y6) = (10, 11.1), which results
745 in Y6G(X6) ≈ 0.01 and g(x6) – y6 ≈ 100[G(X6) – Y6]. This magnification of
746 the error is the cause for the poor fit in Figure 8.13.
747 There is a simple fix, and it is to use the relative error ES, which is given
748 as
0 1 2 3 4 5 6 7 8 9 10
x-axis

9 6
y-axis

0
12
Figure 8.13 Fit of data from Smith et al. [2010] when using the error function in (8.52).

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
32 8 Optimization: Regression
0 1 2 3 4 5 6 7 8 9 10
x-axis

9 6
y-axis

0
12
Figure 8.14 Fit of data from Smith et al. [2010] when using the error function in (8.54).

ES(V1, V2) =	.	(8.54)
i =1	Yi

n 749 V1 + V2Xi – Yi 2 750 Using the solution for the minimum given in Exercise 8.17, the resulting fit
751 is shown in Figure 8.14. As is clear, the fit is definitely better than what was
752 obtained earlier.
753 Examples 8.11
754
755 This example considers fitting data using the model function
756 g(x) = v1xv2, (8.55)
757 which is known as a power law function, and also as an allometric function.
758 It is assumed that the data are (x1, y1), (x2, y2), · · · , (xn, yn), where the
759 xi’s and yi’s are positive. There are a couple of options with this function as

760	explained below. Both of these options are explored in the exercises.
761

762 Option 1: Transform to Linear Regression. To transform into a linear regres-
763 sion problem, let y = v1xv2. Taking the log of this equation, then ln(y) =
764 ln(v1) + v2 ln(x). Setting Y = ln(y) and X = ln(x), then the transformed
765 model function is G(X) = V1 + V2X, where V1 = ln v1 and V2 = v2. The
766 transformed data points (Xi, Yi) in this case are Xi = ln xi and Yi = ln yi.
767 With this, V1 and V2 are determined using (8.53), or using the formula in
768 Exercise 8.17. Once they are computed, then v1 = eV1 and v2 = V2.
769
770 Option 2: Transform to Reduced Non-Linear Regression. It is possible to
771 obtain a simplified non-linear regression problem using this model function.
772 To explain, taking the first partials of E(v1, v2) = (v1xv i 2 – yi)2, and set-
773 ting them to zero, one finds from the first equation that

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.7 Fitting IVPs to Data 33
v1 =
yixv i 2
774 x2 i v2 .
775 Substituting this into the second equation, one ends up with an equation
776 of the form F(v2) = 0. This is easy to solve using the secant method. The
777 benefit of this is that this minimizes the original error function, and not the
778 transformed error function used for linear regression.
779 The transformation to a linear regression problem needs more discussion.
780 First, it requires that the change of variables is defined for the data under
781 consideration. Second, the connection between the original and transformed
782 minimization problems is tenuous. As explained in Section 8.5, different error
783 functions usually produce different minimizers. Indeed, this is evident in the
784 different solutions shown in Figures 8.14 and 8.52. If you are intent on deter-
785 mining the minimizer for (8.50), you can use the solution from the linear
786 regression problem as the starting point when using one of the descent meth-
787 ods considered in the next chapter.
788 8.7 Fitting IVPs to Data
789 Problems in mechanics, electrodynamics, and chemical kinetics often involve
790 solving differential equations. The complication is that the coefficients in
791 the equations are unknown, and they are usually determined by fitting the
792 solution to experimental data. This results in a regression problem which
793 includes solving the differential equations numerically. What follows is an
794 approach for solving such problems.
795 8.7.1 Logistic Equation
796 To give an example of this type of problem, it is often assumed that a popu-
797 lation y(t) of a species satisfies the logistic equation
dy

798

dt = αy(β – y), for 0 < t,

(8.56)

799 where y(0) = a. Note that although it is possible to solve this problem in
800 closed form, in the discussion to follow this will not be used because in most
801 problems such a solution is not available.
802 Consider the data shown in Figure 8.15. Suppose it is assumed that (8.56)
803 applies to this problem. The goal is then to determine α and β in (8.56) from
804 the data. So, suppose the data are (t1, y1), (t2, y2), · · · , (tn, yn). The error

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
34 8 Optimization: Regression
805 function to be used is
E(α, β) =
n i
=1
806 [y(ti) – yi]2. (8.57)
807 To use our minimization methods we need to be able to evaluate this function
808 for given values of α and β. There are various strategies on how this can be
809 done, and we will consider two: a direct method, and an interpolation method.
810 Direct Method: For this one solves (8.56) numerically, and makes sure that
811 the solver calculates the solution at all of the ti’s in the data set. If there is a
812 large number of points, as is the case in Figure 8.15, this can add considerable
813 computing time to the procedure. Nevertheless, because this type of data
814 fitting is so common, many IVP solvers have the ability to control the time
815 points. For example, in MATLAB this is accomplished using the tspan option
816 in ode45.
817 Interpolation Method: The idea is to only use a small number of time steps
818 when computing the numerical solution of the IVP. These points are then
819 connected using an interpolation function, producing an approximation for
820 y(t) over the entire interval. The procedure consists of the following steps:
821 1. Solve the IVP using a coarse time step. It is assumed that the time step
822 required for an accurate numerical solution is significantly larger than the
823 distance between the time points in the data set. In the example below,
824 RK4 is used with a time step of 0.2, while the average distance between
825 the data time points is 0.02.
826 2. With the computed values of the solution of the IVP, use interpolation
827 to produce a function that approximates y(t) over the interval. In the
Figure 8.15 Data and curves from
Raghavan et al. [2012].

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
8.7 Fitting IVPs to Data 35
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
t-axis
1

1.5

2
2.5
3
y-axis

3.5
Figure 8.16 Synthetic data for the logistic equation, shown with the blue dots. The red
curve is the solution of the logistic equation using the parameter values obtained from the
data fit.
828 example below, a clamped cubic spline is used. Note that this requires
829 the value of y(t) at the two endpoints. However, given that we know y(0),
830 and we have just computed y at the right endpoint, we can compute y(t)
831 at the endpoints using the differential equation (the merits of this are
832 discussed in Section 7.7).
833 3. Evaluate the interpolation function at the ti’s in the data set, and from

834	this calculate the error in (8.57).
835

836 Whichever method is used to evaluate E(α, β), it is also necessary to decide
837 on what method to use to find the minimum. Several minimization algorithms
838 are presented in the next chapter. In the examples to follow the Nelder-Mead
839 procedure, which is described in Section 9.6, is used. It is a multivariable
840 version of the secant method in the sense that it locates the minimum without
841 requiring the computation of a derivative. This is available in MATLAB using
842 the command fminsearch.
843 Examples 8.12
844
845 To demonstrate the procedure we will use what is called synthetic data. This
846 is obtained by selecting α = 2 and β = 3, and then evaluating the solution at
847 79 random points in the interval 0 < t < 2. Each value yi is then randomly
848 perturbed a small amount, so the data resembles what is typically seen in
849 experiments. An example of such a data set is shown in Figure 8.16. This
850 data is then fit using the interpolation method. To do this, RK4 with 6 time
851 steps is used to produce the points needed to determine the clamped cubic
852 spline. Using the Nelder-Mead algorithm one finds that, after 60 iteration
853 steps, α = 1.93 and β = 3.04. Using these values the IVP was solved numeri-
854 cally, and the resulting curve is given in Figure 8.16. Note, that if you use the
855 direct method to fit the parameters, then the computing time for this example

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
36 8 Optimization: Regression
856 increases by a factor of about 5, and the final solution is approximately the
857 same.
858 Although there are several steps in the minimization process, the method
859 is straightforward. It is essential to point out that even though it worked very
860 well in the example, the procedure can easily fail. One complication is that it
861 can be unpredictable what values the algorithm might use for α and β when
862 searching for the minimum. For example, it might try to use unphysical values
863 for the parameters, or values that cause the solution to behave in an unstable
864 manner. In effect, this means that for many problems there are constraints
865 on the minimization process, and one needs to build this into the algorithm.
866 To illustrate, for the logistic equation, it is known that α must be positive. If
867 the minimizer attempts to use a negative value this must be overridden, and
868 an alternative put in its place.
869 8.7.2 FitzHugh-Nagumo Equations
870 In computational neuroscience, the FitzHugh-Nagumo equations are used as
871 a model for the potential in the nerve. The equations in this case are
v = c(v – 1
3
872 v3 + w), (8.58)
w = –1
c
873 (v – a – bw). (8.59)
874
875 Synthetic data obtained from these equations in the case of when v(0) = –1,
876 w(0) = 1, a = b = 0.2 and c = 3 are shown in Figure 8.17. The error function
877 in this case is
E(a, b, c) =
n i
=1
878 [v(ti) – vi]2 + [w(ti) – wi]2 , (8.60)
879 where (ti, vi) and (ti, wi) are the data, and v(t) and w(t) are the solution of the
880 FitzHugh-Nagumo equations. The objective is to use a fitting method to find
881 the values of a, b, and c. We will use the interpolation method, which means
882 that to evaluate the error function for a particular (a, b, c), the FitzHugh-
883 Nagumo equations are first solved numerically using a relatively coarse time
884 step. In this particular example, the RK4 method is used with k ≈ 0.34.
885 This solution is then used to construct a clamped cubic spline interpolation
886 function that can be evaluated to obtain values for v and w at the various
887 ti’s appearing in the error function. Finally, the Nelder-Mead algorithm is
888 used to find the minimum. The computed values are a = 0.22, b = 0.37, and
889 c = 2.7. The resulting solution curves are shown in Figure 8.17.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 37
0 2 4 6 8 10 12 14 16 18 20
t-axis
-1.5

-0.5

0.5

v
w

1.5
Figure 8.17 Synthetic data and corresponding numerical solution of the FitzHughNagumo equations in (8.58), (8.59).
890 8.7.3 Parting Comments
891 We used the blunt approach to find the coefficients of the IVPs, which means
892 we just coded everything and then computed the answer. A better way would
893 be to first analyze the problem, determine its basic mathematical properties,
894 and then use a computational approach if necessary. As an example, for
895 the logistic equation (8.56) a simple analysis shows that y = β is a steady
896 state which is asymptotically stable if the parameters are positive. What this
897 mean is that it is possible to just to look at the data in Figures 8.15 and 8.16
898 and produce a reasonably accurate value for β (it is the value the solution
899 approaches as t → ∞). One can also get an estimate of α by noting that at
900 t = 0, y(0) = αa(β – a). From the estimate of β, and using the approxima-
901 tion y(0) ≈ (y(t1) – y(0))/t1 = (y1 – a)/t1, one can obtain an estimate for
902 α. This information can then be used to produce reasonably good guesses for
903 the minimization algorithm.
904 Numerous methods have been proposed for finding material parameters
905 from data. For example, Varah [1982] uses splines but fits them to the data,
906 and then uses the spline to find the parameters. This is known as data smooth-
907 ing, and more recent work on this can be found in Ramsay et al. [2007]. Other
908 possibilities are discussed in T¨ onsing et al. [2019], Dattner [2020], and Mat-
909 suda and Miyatake [2021].
910 Exercises
911 Section 8.3
912 8.1. Suppose the data set is (–1, 0), (–1, 2), (0, 2), (1,–1).
913 (a) Fit a line to the data, and then on the same axes sketch the line and the
914 data points.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
38 8 Optimization: Regression
915 (b) Fit a parabola to the data, and then on the same axes sketch the parabola
916 and the data points.
917 8.2. Do Exercise 8.1 but use the data set (0, –3), (–1, 1), (–1, 3), (1, 1).
918 8.3. Fit the following functions to the data: (–1, y1), (0, y2), (1, y3). With
920 919 this, sketch the resulting model function, and include the given data points.
921 922 (a) g(x) = v1x + v2(x – 1)3
923 y1 = 7, y2 = 1, y3 = 1
924 (b) g(x) = v1(x + 1)2 + v2(x – 1)2
925 926 y1 = –1/2, y2 = 1, y3 = 0
(c) g(x) = v1 1
x + 2
+ v2
1
x – 2
927
928 y1 = 4/3, y2 = 1, y3 = 4/3
929 (d) g(x) = v1x + v2(1 – x2)
930 931 y1 = 1, y2 = 2, y3 = –1
933 932 8.4. Find the least squares solution of the following:
934 (a) x + y = 1
935 2x – y = 2
936 937 –x + 2y = 5
938 939 (c) – 2x + y – z = –3
940 x – y + z = 2
941 x – z = 3
942 943 y – 2z = 1
944 (b) x – 2y = 1
945 x + y = 4
946 947 –2x + y = 7
948 949 (d) x = 1
950 2x = –1
951 –x = 1
952 x = –5
953
954 8.5. For this exercise you must pick one of the following matrices:
955 A1 = ⎛ ⎝–1 0 1 1 0 1⎞ ⎠ , A2 = ⎛ ⎝1 1 1 2 1 –1⎞ ⎠ , A3 = ⎛ ⎝– –1 2 1⎞ ⎠ .
956
957 (a) Using (8.25), find A+.
958 (b) Using the formula in Exercise 8.7(a), find A+.
959 (c) It is said that A+ is a left inverse for A, but it is not necessarily a right
960 inverse. By calculating A+A and AA+, explain what this means.
961 8.6. This problem concerns the constant that best approximates the data
962 (x1, y1), (x2, y2), · · · , (xn, yn). So, the model function is g(x) = v.
963 (a) Find the value of v that minimizes E(v) = ||g – y||2 2.
964 (b) Find the value of v that minimizes E(v) = ||g – y||∞. Suggestion: first
965 answer this for n = 2, and then n = 3. From this, explain what the
966 answer is for general n.
967 8.7. In this exercise formulas for the Moore-Penrose pseudo-inverse are
968 derived. Assume that the columns of the n × m matrix A are independent.
969 (a) Using (8.22), show that A+ = R– m1GT , where Rm is given in (8.23).

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 39
970 (b) Using the SVD, A = UΣVT . Use this to show that A+ = VΣ–T UT ,
971 where Σ–T is the n × m diagonal matrix with diagonals 1/σ1, 1/σ2, · · · ,
972 1/σm.
973 8.8. This exercise considers what happens when there is only one parameter
974 when data fitting using linear least squares. Assume here that f(x) is a given
975 function.
976 (a) Suppose the model function is g(x) = vf(x). Using least squares error,
977 what is the resulting formula for v? In doing this, assume that there is
978 at least one data point with f(xi) nonzero.
979 (b) Suppose the model function is g(x) = v + f(x). Using least squares error,
980 what is the resulting formula for v?
981 (c) Suppose that f(x) = x3, and the data are: (–1, 1), (0, 2), (2, 3). Sketch
982 the model function from part (a), and include the data points in the
983 sketch.
984 (d) Do part (c) but use the model function from part (b).
985 8.9. This exercise concerns fitting the data: (–1, 8), (0, 4), (1, 0), (2, 2),
986 (3, 10) using the model function (8.12) with m = 3.
987 (a) Plot (by hand) the data. Use this to select a three parameter (v1, v2, v3)

988
989

model function of the form given in (8.12). Make sure to explain why it
is an appropriate choice for this data set.
(b) Compute v1, v2, v3 using (8.25). Also, report on the value of κ∞(AT A).

990 991 Finally, plot the resulting model function and data set on the same axes.
992 (c) Compute v1, v2, v3 using the formula in Exercise 8.7(a). Also, report
993 on the value of κ∞(Rm). Finally, plot the resulting model function and
994 data set on the same axes.
995 8.10. This exercise concerns fitting the data: (–4, 2.1), (–3, 2.2), (–2, 2.5),
996 (–1, 3), (0, 6), (1, 9), (2, 24) using the model function (8.12) with m = 2.
997 (a) Plot (by hand) the data. Use this to select a two parameter (v1, v2)
998 model function of the form given in (8.12). Make sure to explain why it
999 is an appropriate choice for this data set.
1000 (b) Compute v1, v2 using (8.25). Also, report on the value of κ∞(AT A).
1001 Finally, plot the resulting model function and data set on the same
1002 axes.
1003 (c) Compute v1, v2 using the formula in Exercise 8.7(a). Also, report on the
1004 value of κ∞(Rm). Finally, plot the resulting model function and data
1005 set on the same axes.
1006 8.11. In the case of when the model function is g(x) = v, (8.29) takes the
1007 form ER(v) = n i=1(v – yi)2 + λv2. Also, let E(v) = n i=1(v – yi)2.
1008
1009 (a) By minimizing ER, show that v = vM, where vM = αyM, yM is the
1010 average of the yi’s, and α = 1/(1 + λ/n).
1011 (b) Show that the minimum value of E(v) occurs when v = yM.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
40 8 Optimization: Regression
1012 (c) The regularization parameter λ is usually chosen to be fairly small, with
1013 the objective that E(vM) is close to the minimum value of E(v). With
1014 this in mind, suppose that the goal is that |E(vM) – E(yM)| ≈ 10–6.
1015 Explain why the resulting value of λ depends on the number of data
1016 points. Hint: Show that E(vM) = E(yM) + n(α – 1)2yM2 .
1017 8.12. This exercise explores how regularization can be used with the QR
1018 approach.
1019 (a) Starting with (8.29), modify the argument used to obtain (8.21) to show
1020 that
1021 ER(v) = ||Rmv – ym||2 2 + λ||v||2 2 + ||yn–m||2 2.
1022 (b) Show that the minimizer of ER(v) is determined by solving
1023 (RT mRm) + λIv = RT my.
1024 (c) Show that the equation in part (b) reduces to (8.22) when λ = 0.
1025 8.13. Suppose that the error function is Eb(v) = ||Av – y||2 2 + λ||v – b||2 2.
10267 (a) In this version of regularization, what is a reason, or what is the objec-
1028 tive, for including b in this way?
1029 (b) Find the minimizer.
1030 8.14. This exercise involves Tikhonov regularization, which uses the error
1031 function
1032 ET(v) = ||Av – y||2 2 + λ||Dv||2 2,
1033 where D is a given n × m matrix. Note that this reduces to ridge regulariza-
1034 tion if D = I.
1035 (a) Show that the minimizer of ET(v) is found by solving (ATA+
1036 λDTD)v = ATy.
1037 (b) Tikhonov regularization can be used to smooth the curve shown in
1038 Figure 8.18. The points on this curve are (xi, yi), for i = 1, 2, · · · , n,
1039 where the xi’s are equally space. Now, the curvature at xi is approx-
1040 imately (yi+1 – 2yi + yi–1)/h2, where h is the xi spacing (see Section
1041 7.2.1). Less noisy curves are associated with smaller values of the curva-
1042 ture, which means we want to reduce the value of |yi+1 – 2yi + yi–1|/h2.
1043 The objective is to find v that is close to y but which also has curvature
1044 values that are not too big. Explain why this can be achieved by finding
1045 the minimizer of ET(v), where A = I is n × n, and D is the (n – 2) × n
1046 matrix given as
D =
⎛⎜⎜⎜⎝
1 –2 1
1 –2 1
.
.
.
.
.
.
.
.
.
1 –2 1
⎞⎟⎟⎟⎠
1047 .

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 41
0 0.2 0.4 0.6 0.8 1
x-axis

-1
-0.5

0
0.5
1
y-axis

Figure 8.18	Noisy curve that is smoothed using Tikhonov regularization in Exercise
8.14.

1048 Note that the h4 term has been absorbed into the λ.
1049 For part (c) you need curve data, which is either provided or you gen-
1050 erate. The curve in Figure 8.18 was obtained using the MATLAB com-
1051 mands: x=linspace(0,1,n); y=exp(-10*(x-0.6).^2).*cos(6*pi*x);
1052 y=y+0.4*(rand(1,n)-0.5);.
1053 (c) On the same axes, plot the noisy and smoothed versions of the curve.
1054 What value of λ did you use and why did you make this choice? In
1055 answering this question you should include two additional plots: one
1056 when λ = λ/20 and one when λ = 20λ. Why is your λ so much larger
1057 than what was used in Figure 8.8?
1058 8.15. (a) The normal equation has a unique solution if the columns of A are
1059 independent. Explain why this requires that there are at least m distinct
1060 (i.e., different) xi’s in the data set.
1061 (b) In the case of when the number of data points is the same as the number
1062 of parameters show that least squares reduces to interpolation. This is
1063 assuming, of course, that the columns of A are independent.

1064	(c) Explain why the normal equation (8.17) does not have a unique solution,
1065 1066	and A+ is not defined, if one of the fitting functions φj(x) in (8.12) is zero at all of the xi’s.

1067 (d) Suppose that all of the fitting functions φj(x) in (8.12) are zero at x1.
1068 Explain why the resulting solution for v does not depend on y1.
1069
1070 8.16. (a) Show that the system of normal equations (8.10) can be written as
1071 v1 + v2xM = yM,
v1 + v2xM + nx1M n
i=1
(xi – xM)2 = yM + nx1M n
i=1
1072 (xi – xM)(yi – yM),
1073
where xM = 1
1074 n n i=1 xi and yM = n1 n i=1 yi.
1075 (b) By solving the equations in part (a), show that the resulting model
1076 function can be written as g(x) = yM + m(x – xM), where the slope is

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
42 8 Optimization: Regression

xi	–π 0 π
yi	1 0 2

Table 8.6 Data for Exercise 8.18.
m =
n i=1(xi – xM)(yi – yM)
1077 n i=1(xi – xM)2 .
1078 The numerator and denominator are proportional to the covariance and
1079 the square of the x-variable standard deviation, respectively.
1080 8.17. The exercise considers a least squares problem using the relative error.
1081 As usual, the data are (x1, y1), (x2, y2), · · · , (xn, yn). The error function in
1082 this case is
ES =
n i
1083 =1g(xiy)i– yi 2 .
1084 It is assumed here that the yi’s are nonzero.
1085 (a) In the case of when g(x) = v1 + v2x, show that the minimum occurs
1086 when
1087 vv1 2 = ad –1 b2 –ab d –bx1i/y /yii ,
1088 where a = x2 i /yi2, b = xi/yi2, and d = 1/yi2.
1089 (b) Calculate v1 and v2 for the data in Table 8.1. On the same axes, plot
1090 the resulting line, the data points, and the line found using (8.11).
1091 8.18. The exercise considers the least squares problem using the general
1092 linear model function g(x) = v1p(x) + v2q(x). As usual, the data are (x1, y1),
1093 (x2, y2), · · · , (xn, yn), where n > 2.
1094 (a) Using the error function E(v1, v2) = n i=1[g(xi) – yi]2, show that
1095 v v1 2 = p2 i qi2 1– ( piqi)2 –qpi2iqi –pp2 iiqipqiiyyii ,
1096 where pi = p(xi) and qi = q(xi).

1097
1098

(b) Show that the solution in part (a) reduces to the one given in (8.11) in
the case of when p(x) = 1 and q(x) = x.

1099 (c) Suppose the model function is g(x) = v1x + v2x2 and the data is given
1100 in Table 8.6. Find v1 and v2.
1101 (d) Suppose the model function is g(x) = v1 sin(x) + v2 sin(2x) and the data
1102 is given in Table 8.6. Explain why it is not possible to determine v1 and
1103 v2.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 43
1104 (e) The question is, what requirement is needed to prevent the problem in
1105 part (d), or more generally what is needed so the solution in part (a)
1106 is defined. Setting p = (p1, p2, · · · , pn) and q = (q1, q2, · · · , qn), explain
1107 why the needed condition is that the angle θ between p and q is not 0

1108	or π. Note that if either p or q is the zero vector then θ = 0.
1109	(f) What is the matrix A, as defined in (8.14), for this problem? Explain

1110 why the condition derived in part (e) is equivalent to the requirement
1111 that the columns of A are independent.
1112 (g) What are p and q for part (d)?
1113 (h) If n = 3 and p = (1, –1, 1), give two nonzero examples for q where the
1114 solution in part (a) is not defined.
1115 Section 8.4
1116 8.19. In R3, find an orthogonal matrix that reflects vectors through the given
1117 plane: (a) x – 3y + 4z = 0, (b) 4x – y – 2z = 0, (c) x = 0.
1118 8.20. Use the Householder method to find a QR factorization for the follow-
1119 ing:
1120 (a) 0 0 2 –2 1121 (b) 1 2 1 2
1122 (c) ⎛ ⎝1 1 0 1 1 – –1 1 1 0⎞ ⎠ 1123 (d) ⎛ ⎝0 1 1 0 0 0 2 –1 1⎞ ⎠
1124

(e)	0	2	1125
0	1

⎛ ⎝1 –1⎞ ⎠ (f) ⎛ ⎝1 1 1⎞ ⎠
1126 8.21. How many QR factorizations are there of
1127 A = 2 0 1 –1 ?
1128 8.22. The following are true or false. If true, provide an explanation why it
1129 is true, and if it is false provide an example demonstrating this along with
1130 an explanation why it shows the statement is incorrect.
1131 (a) If Q is orthogonal then so is –Q.
1132 (b) If Q is orthogonal then so is I + Q.
1133 (c) If Q is orthogonal then so is Q2.
1134 (d) If Q and A are n × n, and Q is orthogonal, then ||QA||2 = ||A||2.
1135 8.23. Using the fact that if Q is an n × n orthogonal matrix then ||Qx||2 =
1136 ||x||, for any x ∈ Rn, show the following:

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
44 8 Optimization: Regression
1137 (a) κ2(Q) = 1.
1138 (b) For any n × n invertible matrix A, κ2(AQ) = κ2(A).
1139 (c) For any n × n invertible matrix A, κ2(QA) = κ2(A).
1140 8.24. Suppose
1141 Q = a b c d
1142 is an orthogonal matrix.
1143 (a) Explain why it is always possible to find θ and φ so that a = cos θ,
1144 c = sin θ, and b = cos φ, d = sin φ.
1145 (b) Using the result from part (a), show that the only two possibilities for
1146 Q are
1147 Q = cos sin θ θ –cos sin θ θ or Q = cos sin θ θ – cos sin θ θ .
1148 For the record, the one on the left is a rotation and the one on the right
1149 is a reflection.
1150 8.25. Find the connection(s) between the numbers a, b, c, and d so the
1151 following is a Householder matrix:
1152 a b c d .
1153 8.26. This problem concerns some of the properties of the Householder
1154 matrix in (8.37).
1155 (a) Explain why H is symmetric.
1156 (b) Using (8.37), show that H2 = I.
1157 (c) Find Hv.
1158 (d) If v is a normal vector then so is αv for any nonzero number α. Show
1159 that the Householder matrix obtained using αv is exactly the same as
1160 the one obtained using v.
1161 8.27. If A is an invertible n × n matrix, then AT A is symmetric and positive
1162 definite. Because of this, it is possible to write the Cholesky factorization of
1163 the product as AT A = LDLT , where L is a unit lower triangular matrix and
1164 D is diagonal with positive entries along the diagonal (see Exercise 3.14).
1165 Show that a QR factorization of A has Q = AL–T D–1/2 and R = D1/2LT .
1166 To do this you need to show that the matrices are orthogonal and upper
1167 triangular, respectively, and they multiply to give A.
1168 8.28. This problem explores using QR instead of LU when solving Ax = b.
1169 In what follows assume that A is an n × n invertible matrix.
1170 (a) Assuming the QR factorization of A is known, using an explanation
1171 similar to that used in Section 3.1, show that Ax = b can be solved
1172 by first computing y = QT b, and then solving Rx = y. In this problem
1173 this is called the QR approach.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 45
-5 0 5
x-axis
-3
3

0
y-axis

Figure 8.19 Example data set to be fitted by elliptic curve in Exercise 8.30.
1174 (b) Explain why R is invertible if A is invertible. Also, using the fact that
1175 ||Qx||2 = ||x||2, ∀ x, show that κ2(R) = κ2(A).
1176 (c) Use the QR approach to solve, by hand, Ax = b when A is given in
1177 (8.44) and b = (–3, 1, 1)T .
1178 (d) Use the QR approach to solve the matrix equation in (3.11), for n =
1179 4, 8, 12, 16, 160, 1600, and then report the error as in Table 3.3. Comment
1180 on any differences you observe with the LU results.
1181 (e) Use the QR approach to solve the Vandermonde matrix equation in
1182 (3.12), for n = 4, 8, 12, 16, 160, and then report the error as in Table 3.4.
1183 Comment on any differences you observe with the LU results.
1184 Section 8.5
1185 8.29. Assume the data are (x1, y1), (x2, y2), · · · , (xn, yn), with y1 < y2 <
1186 · · · < yn. Also, the model function is a constant, so g(x) = v1.
1187 (a) For n = 3, find v1 that minimizes ||g – y||1.
1188 (b) For n = 3, find v1 that minimizes||g – y||∞.
1189 (c) How do your answers from (a) and (b) change for general n, with n ≥ 3?
1190 (d) In data analysis, words such as mean, median, harmonic mean, midrange,
1191 and root mean square are used. Do any of these apply to your answers

1192	in parts (a)-(c)?
1193	(e) Answer part (d) for Exercise 8.6.

1194 8.30. This exercise explores fitting an ellipse to a data set, and an example
1195 of such data is given in Figure 8.19. The formula for the ellipse is

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
46 8 Optimization: Regression
1196 xa2 + yb 2 = 1,
1197 where a and b are positive and to be determined from the curve fitting. In
1198 what follows you are to find an error function and from this find a and b.
1199 (a) For a given a and b, write down an expression that can be used as a
1200 measure of the error between a data point (xi, yi) and the curve. The
1201 requirements on this expression are: it is non-negative and only zero if
1202 (xi, yi) is on the ellipse.
1203 (b) Using your expression from part (a), sum over the data points to produce
1204 an error function E(a, b).
1205 (c) Minimize E(a, b) and find explicit formulas for a and b. Note that if your
1206 error function does not result in you being able to find explicit formulas,
1207 then you need to redo parts (a) and (b) so you can.
1208 Section 8.6
1209 8.31. Given the model function and the (xi, yi) data values, do the following:
1210 (a) Transform the model function into one of the form G(X) = V1 + V2X, and
1211 give the corresponding (Xi, Yi) data values (similar to Table 8.5). (b) Cal-
1212 culate V1, V2 and the corresponding values of v1, v2 using the error function
1213 (8.52), and also using (8.54). (c) Plot the (xi, yi) data and the two resulting
1214 model functions on the same axes for 0 ≤ x ≤ 2. Which is the better error
1215 function, (8.52) or (8.54)?
1216 (a) g(x) = ln(v1 + v2x2)
1217

(0, –1), (1, 1), (2, 2)	1222	(0, 2), (1, 5), (2, 12)
1221	(c)	g(x) = v110v2x
1223

1218
1219 (b) g(x) = 1/ v1 + v2 sin(πx/4) 2
1220 (0, 1), (1, 4), (2, 9)
1224 (d) g(x) = (x + v1)/(x + v2)
1225 (0, 5), (1, 3), (2, 2)
1226 8.32. The elliptical path of a comet is described using the equation

r =	p 1 + ε cos θ ,

1227 1228 where r is the radial distance to the Sun, θ is the angular position, ε is the
1229 eccentricity of the orbit, and p is a rectum parameter. In this exercise you
1230 are to use data for the comet 27P/Crommelin, which is given in Table 8.7.

θi	0.00	1.57	3.14	4.71	6.28
ri	0.62	1.23	16.14	1.35	0.62

Table 8.7 Data for the comet 27P/Crommelin, used in Exercise 8.32.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 47
1231 (a) Writing the model function as r = g(θ), and using least squares, then
1232 the error function is
E(p, ε) =
n i
=1
1233 [g(θi) – ri]2.
1234 What two equations need to be solved to find the value(s) of p and ε
1235 that minimize this function?
1236 (b) By writing the model function as
1 r
=
1 + ε cos θ
p
1237 ,
1238 explain how the nonlinear regression problem can be transformed into
1239 one with the model function R = V1 + V2 cos(θ). Also, what happens to
1240 the data values?
1241 (c) Writing the model function in part (b) as R = G(θ), and using the least
1242 squares error function
E(V1, V2) =
n i
=1
1243 [G(θi) – Ri]2,
1244 compute V1 and V2. Using these values, determine p and ε.
1245 (d) Redo part (c) but use the relative least squares error function
ES(V1, V2) =
n i
1246 =1G(θiR) i– Ri 2 .
1247 (e) Does part (c) or does part (d) produce the better answer? You need to
1248 provide an explanation for your conclusion, using a quantified compari-
1249 son, graphs, and/or other information.
1250 8.33. This exercise explores using the power law function (8.55) with the
1251 data in Table 8.8. What is given, for each animal, is its typical mass and its
1252 maximum relative speed. The latter is the animal’s maximum speed divided
1253 by its body length. The latter does not include the tail and is measured in
1254 meters.
1255 (a) Taking x to be the mass, fit the power law to the data in Table 8.8
1256 by transforming it into a problem in linear regression. Compute the
1257 parameters using (8.52). With this, plot the data and power law curve
1258 using a log-log plot. Finally, compute the value of the nonlinear error
1259 function E = (v1xv i 2 – yi)2.
1260 (b) Do part (a) but, instead of (8.52), use (8.54).
1261 (c) What was the running speed of a Tyrannosaurus rex? Assume here that
1262 its tail was approximately 1/3 of its total body length. Explain which
1263 regression method you used to answer this question, and why you made
1264 this choice.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
48 8 Optimization: Regression

Animal	Mass (kg)	Relative Speed (1/s)
canyon mouse	1.37e–02	39.1
chipmunk	5.10e–02	42.9
red squirrel	2.20e–01	20.5
snowshoe hare	1.50e+00	35.8
red fox	4.80e+00	28.7
human	7.00e+01	7.9
reindeer	1.00e+02	12.7
wildebeest	3.00e+02	11.0
giraffe	1.08e+03	3.8
rhinoceros	2.00e+03	1.8
elephant	6.00e+03	1.4

Table 8.8 Data for Exercise 8.33 adapted from Iriarte-D´ıaz [2002].
1265 8.34. The computing times for three matrix algorithms are given in Table
1266 8.9, which is adapted from Table 4.12. The assumption is that the computing
1267 time T can be modeled using the power law function
1268 T = αN β.
1269 The goal of this exercise is to find α and β. Note that you do not need to
1270 know anything about how the matrix methods work to do this exercise.
1271 (a) By transforming this into a problem in linear regression, fit the model
1272 function to the LU times, and then plot the model function and data on
1273 the same axes.
1274 (b) For larger matrices, the flop count for LU is 2 3n3. Is there any connection
1275 between the values in this formula and your values from part (a)? Should
1276 there be?
1277 (c) Show that to minimize the least-square error E = [T(Ni) – Ti]2 one
1278 gets that
α =
TiNiβ
1279 Ni2β .
1280 (d) The result in part (c) means E(α, β) can be written as only a function
1281 of β, in other words E(α, β) = G(β). It is not unreasonable to expect
1282 that the computing time is a reflection of the flops used to calculate the
1283 factorization. If so, then β should be a positive integer. What positive
1284 integer value of β minimizes G(β)? What is the resulting valuer for α?

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 49

N	LU	QR	SVD
200	0.0003	0.0006	0.0050
400	0.0011	0.0021	0.0192
600	0.0021	0.0046	0.0392
800	0.0037	0.0085	0.0769
1000	0.0061	0.0143	0.1248
2000	0.0280	0.0890	0.9859
4000	0.1585	0.5242	5.6501

Table 8.9 Data for Exercise 8.34. Computing time, in seconds, for a LU, QR, and SVD
factorization of a random N × N matrix using MATLAB.
1285 Also, using these values, plot the model function and data on the same
1286 axes. Make sure to explain how you find β.
1287 (e) Redo (a) for the QR times.
1288 (f) Redo (a) for the SVD times.
1289 Section 8.7
1290 The exercises to follow require synthetic data, which is either provided or
1291 you generate. For the latter, suppose that by using either the exact solution,
1292 or a numerical solution, you determine values y = (y1, y2, · · · , yn)T of the
1293 solution at the time points t1, t2, · · · , tn. By using the MATLAB commands:
1294 q=a*(2*rand(n,1)-1); yd=(1+q).*y; you can produce synthetic data. In
1295 doing this, the synthetic data values will satisfy (1 – a)y ≤ yd ≤ (1 + a)y.
1296 For example, in Figure 8.16, a = 0.1 while in Figure 8.17, a = 0.2.
1297 8.35. This problem concerns the Bernoulli equation
y + y3 = y
a + t
1298 , for t > 0,
1299 where y(0) = 1. The exact solution is
y =
a + t
1300 β + 2 3(a + t)3 ,
1301 where β = a2(1 – 2a/3).
1302 (a) Taking a = 0.01, generate synthetic data for this problem using 400
1303 equally spaced time points over the interval 0 < t ≤ 2.
1304 (b) Using either the data from part (a), or a data set provided, find a. Make
1305 sure to explain the method you use to solve this problem. Also, plot the
1306 data and numerical solution using these parameter values.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
50 8 Optimization: Regression
1307 8.36. In the description of the dynamics of excitons in semiconductor
1308 physics, one finds the following problem
1309 n = γ – αn2x,
x = αn2x – x
1310 1311 1 + x ,
1312 where n(0) = 1 and x(0) = 1. Also, α and γ are constants that satisfy
1313 0 < γ < 1 and α > 0. Note, x(t) and n(t) are densities and are therefore
1314 non-negative.
1315 (a) Taking α = 0.1 and γ = 0.2, generate synthetic data for n and x using
1316 500 equally spaced time points over the interval 0 < t ≤ 200.
1317 (b) Using either the data from part (a), or a data set provided, find α and γ.
1318 Also, plot the data and numerical solution using these parameter values.
1319 If you modify the minimization code make sure to explain why you did
1320 so, and what exactly you did change.
1321 Additional Questions
1322 8.37. Typically, in reporting experimental data, at any given xi, the mean yi
1323 and standard deviation σi are given. Examples are shown in Figures 5.1 and
1324 8.12. The objective of this exercise is to derive a regression procedure that
1325 produces a predicted value y that: i) of foremost importance, falls within the
1326 interval yi – σi < y < yi + σi, and ii) of secondarily importance, has y ≈ yi.
1327
1328 (a) Setting Ei = (y – yi)/σi p, on the same axes, sketch Ei as a function of
1329 y for p = 2, 4, 8. Use this to explain why taking a larger p has the effect
1330 of giving more weight to objective (i).
1331 (b) Taking p = 2, y = a + bx, and the error function E = n i=1 Ei, find a
1332 and b that minimize E.
1333 (c) Suppose one gets better, and presumably more expensive, experimental
1334 equipment so the standard deviations are all reduced by a factor of, say,
1335 10. Assuming the same yi values are obtained, explain why the answer
1336 in part (b) is unchanged.
1337 8.38. This exercise considers using a cubic spline as the model function when
1338 using least squares. An example of this is shown in Figure 8.20. Assume the
1339 data points are (x1, y1), (x2, y2), · · · , (xn, yn). These are going to be fitted
1340 with the cubic spline function
s(x) =
m+1
j
=0
1341 ajBj(x),
1342 where Bj(x) is given in (5.23). The nodes for the B-splines are assumed to
be ¯ x1, ¯ x2, · · · , ¯ xm, where the ¯ xj’s are equally spaced between ¯ x1 and ¯ xm.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard

UNCORRECTED PROOF
Exercises 51
Jan July Jan July Jan
Date
-20

0
Tem

20
perature

40
Figure 8.20 Temperature data over a two year period, and the cubic spline fit using
least squares as considered in Exercise 8.38 [NCEI, 2015].
1343 Note that in (5.23), h is the spacing between the ¯ xj points. In this exercise,
1344 m is taken to be given, and then least squares is used to determine the aj’s.
1345 Also, the reason for including j = 0 and j = m + 1 in the sum is explained
1346 in Section 5.4.1.
1347 (a) Show that to obtain the minimum of the error function
E(a0, a1, · · · , am+1) =
n i
=1
1348 [s(xi) – yi]2
1349 one needs to solve Ca = d, where a = (a0, a1, · · · , am+1)T , C = BBT ,
1350 d = By, y = (y1, y2, · · · , yn)T , and B is a (m + 2) × n matrix. In par-
1351 ticular, the ( , k) entry of B is B–1(xk).
1352 (b) What value should you take for ¯ x1? What about ¯ xm?
1353 (c) Taking m = 3, compute the aj’s using the data in Table 8.10. With this,
1354 plot s(x) and the data on the same axes, for 1 ≤ x ≤ 715. Also, compute
1355 ||C||∞ and comment on whether the matrix is ill-conditioned.
1356 (d) Redo part (c), but take m = 4, m = 5, m = 6, and m = 13.
1357 (e) Based on your results from parts (c) and (d), for a data set with n points,
1358 what do you recommend to use for m? In answering this, keep in mind
1359 that the model function should be capable of reproducing the more sig-
1360 nificant trends in the data, as well as provide a reasonable approximation
1361 over the entire interval.

x	1	48	94	144	193	242	286	334	382	430	474	532	592	651	715
y	-5.6	-5	10.6	17.2	25	27.2	17.2	6.1	-4.3	-3.3	23.3	22.8	31.7	25.6	12.8

Table 8.10 Temperature data for Exercise 8.38(c) [NCEI, 2015]. Note, x is measured in
days, with x = 1 corresponding to January 1, and x = 715 corresponding to December 15
of the following year.

333861˙2˙En˙8˙Chapter Disp.:6/2/2023 Pages: xxx Layout: T1-Standard