home

author: niplav, created: 2022-10-19, modified: 2022-11-25, language: english, status: in progress, importance: 2, confidence: likely

Solutions to the textbook “Introduction to Optimization”.

Contents

Solutions to “Introduction to Optimization”

4.0.1

See here.

4.11

(i)

Some standard numpy code:

import numpy as np

def givec(x, c=10):
    n=len(x)
    iexp=np.array(range(0,n))
    cdiag=c**(iexp/((n-1)*np.ones(n)))
    C=np.diag(cdiag)
    return C

def fsq(x, c=10):
    C=givec(x, c=c)
    return x.T@C@x

def fhole(x, c=10, a=0.1):
    retval=fsq(x, c=c)
    return retval/(a**2+retval)

The derivative of f_{\text{sq}}=x^{\top}Cx is (C+C^{\top})x (as per identity 81 from the Matrix Cookbook).

Which is excellent, since the identity 2.3 of Maths for Intelligent Systems (Toussaint 2022) gives \frac{\partial}{\partial x} x^{\top}Cx=x^{\top}C \frac{\partial}{\partial x}x+x^{\top} C^{\top} \frac{\partial}{\partial x} x which would be x^{\top}C+x^{\top}C^{\top}=C^{\top}x+Cx=(C^{\top}+C)x and therefore the same as above (and in identity 97 from the Matrix Cookbook).

So then

def fsqgrad(x, c=10):
    C=givec(x, c=c)
    return (C.T+C)@x

Since f_{\text{sq}} and f_{\text{hole}} return a scalar, one can use a simple quotient rule:

f_{\text{hole}}(x)=\\ \frac{(C^{\top}+C)x\cdot(a^2+x^{\top}Cx)-x^{\top}Cx\cdot (C^{\top}+C)x)}{(a^2+x^{\top}Cx)^2}= \\ \frac{(C^{\top}+C)x\cdot(a^2+x^{\top}Cx-x^{\top}Cx)}{(a^2+x^{\top}Cx)^2}= \\ \frac{a^2 \cdot (C^{\top}+C)x}{(a^2+x^{\top}Cx)^2}

The implementation is straightforwardly

def fholegrad(x, c=10, a=0.1):
    retval_fsq=fsq(x, c=c)
    retval_fsqgrad=fsqgrad(x, c=c, a=a)
    return (retval_fsqgrad*(a**2+retval_fsq)-retval_fsq*retval_fsqgrad)/((a**2+retval_fsq))**2

(ii)

The function can be pretty directly translated from the pseudocode in the book to Python (no surprise here):

def graddesc(f, fgrad, alpha=0.001, theta=10e-20, x0=[1,1], c=10, a=0.1):
    x=np.array(x0)
    i=0
    prevcost=-math.inf
    curcost=f(x)
    while abs(prevcost-curcost)>theta:
        print("#iteration: ", i)
        print("current cost: ", curcost)
        prevcost=curcost
        x=x-alpha*fgrad(x)
        curcost=f(x)
        i=i+1
    print("solution: ", x)
    print("iterations: ", i)
    print("cost: ", f(x))
    return x

Executing graddesc with fsq/fsqgrad and fhole/fholegrad gives ~1000 iterations for finding a local minimum with fsq, and takes ~185k iterations for fhole. But it finds the minimum:

print("fsq:")
graddesc(fsq, fsqgrad)
print("fhole:")
graddesc(fhole, fholegrad)

fsq:
solution:  [4.98354923e-09 1.65118900e-84]
iterations:  9549
cost:  2.4835762963261225e-17
fhole:
solution:  [ 4.13240000e-11 -7.98149419e-19]
iterations:  184068
cost:  1.7076729729684466e-19

(iii)

def backtrack_graddesc(f, fgrad, alpha=0.05, theta=10e-15, x0=[1,1], c=10, a=0.1, rho_ls=0.01, rho_plus=1.2, rho_minus=0.5, delta_max=math.inf):
    x=np.array(x0)
    i=0
    prevcost=-math.inf
    curcost=f(x)
    delta=-fgrad(x)/np.linalg.norm(fgrad(x))
    while np.linalg.norm(alpha*delta)>=theta:
        while f(x+alpha*delta)>f(x)+rho_ls*fgrad(x)@(alpha*delta).T:
            alpha=rho_minus*alpha
        prevcost=curcost
        x=x+alpha*delta
        alpha=min(rho_plus*alpha, delta_max)
        curcost=f(x)
        i=i+1
    return x

iv)

First we compute the derivative of \underset{β}{\text{argmin}}||y-Xβ||^2+λ||β||^2 :

\frac{\partial ||y-Xβ||^2+λ||β||^2}{\partial β}= \\ \frac{\partial ||y-Xβ||^2}{\partial β} + \frac{\partial λ||β||^2}{\partial β}= \\ \frac{\partial ||-Xβ-(-y)||^2}{\partial β}+λ2β=\\ 2 \cdot \frac{-Xβ-(-y)}{||-Xβ-(-y)||}+λ2β

4.3.1

4.3.3

f(x)=\frac{1}{2}x^{\top} A x+b^{\top}+c=\\ \frac{1}{2} x^{\top} A x=\\ \frac{1}{2} x^{\top} \left [\matrix{2 & 0 \cr 0 & 1}\right ] x

(since b=\mathbf{0} and c=0 ). So the negative gradient is

-\nabla f(x)= \\ \frac{1}{2} x^{\top} (A+A^{\top})=\\ \frac{1}{2} x^{\top} \left [\matrix{4 & 0 \cr 0 & 2}\right ]

(i)

x_0=\left [ \matrix{1 \cr 1} \right ] , so

δ_0=\\ -\nabla f(x_0)=\\ -\frac{1}{2} \left [ \matrix{1 & 1} \right ] \left [\matrix{4 & 0 \cr 0 & 2}\right ]=\\ -\left [ \matrix{2 & 1} \right ]

Then

f(x+αδ)=\\ \frac{1}{2} (\left [ \matrix{1 \cr 1} \right]-α\left [ \matrix{2 \cr 1} \right ])^{\top}\left [\matrix{2 & 0 \cr 0 & 1}\right ](\left [ \matrix{1 \cr 1} \right]-α\left [ \matrix{2 \cr 1} \right ])=\\ \frac{1}{2} (\left [ \matrix{1 \cr 1} \right]-α\left [ \matrix{2 \cr 1} \right ])^{\top}(\left [ \matrix{2 \cr 1} \right]-α\left [ \matrix{4 \cr 1} \right ])=\\ \frac{1}{2} (\left [ \matrix{1 & 1} \right]-α\left [ \matrix{2 & 1} \right ])(\left [ \matrix{2 \cr 1} \right]-α\left [ \matrix{4 \cr 1} \right ])=\\ \frac{1}{2}(3-10\cdot α+9 \cdot α^2)=\\ 1.5-5\cdot α+9 \cdot α^2

The derivative of that is 5+18 \cdot α . Setting it to zero gives α=\frac{5}{18} .

The next x then is (going by the values we computed) \left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ] .

The new gradient then is

g=\\ -\nabla f(x)=\\ -\frac{1}{2} \left [ \matrix{0.\overline{4} & 0.7\overline{2}} \right ] \left [\matrix{2 & 0 \cr 0 & 1}\right ]=\\ \left [ \matrix{-0.\overline{4} & -0.36\overline{1}} \right ]

Now computing β :

\frac{g^{\top}(g-g')}{g'^{\top}g'}=\\ \frac{\left [ \matrix{-0.\overline{4} & -0.36\overline{1}} \right ]^{\top}(\left [ \matrix{-0.\overline{4} & -0.36\overline{1}} \right ]-\left [ \matrix{-2 & -1} \right ])}{\left [ \matrix{-2 & -1} \right ]^{\top}\left [ \matrix{-2 & -1} \right ]}\approx \\ -0.1844

which is smaller than zero, so our new β is 0 .

Then δ is \left [ \matrix{-0.\overline{4} & -0.36\overline{1}} \right ] .

We now go through the whole exercise a second time:

f(x+αδ)=\\ \frac{1}{2} (\left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ]-α\left [ \matrix{-1.8765 & -1.4938} \right ])^{\top}\left [\matrix{2 & 0 \cr 0 & 1}\right ](\left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ]-α\left [ \matrix{-1.8765 & -1.4938} \right ])=\\ (\frac{1}{2} (\left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ]-α\left [ \matrix{-1.8765 & -1.4938} \right ])^{\top}\left [ \matrix{0.\overline{8} \cr 0.7\overline{2}} \right]-α\left [ \matrix{-0.\overline{8} \cr -0.36\overline{1}} \right ]) \approx \\ 0.458\overline{3}-0.6559\cdot α+0.2627 \cdot α^2

Then the derivative is approximately -0.6559+0.5255 \cdot α , and the new α is the solution to that, namely \approx 1.248 .

Then the new x is \approx \left [ \matrix{-0.1103 & 0.2715} \right ] . The new g is \approx \left [ \matrix{0.11029 & -0.1358} \right ] , and β\approx 0.0933 , which is this time greater than zero so we can keep it.

The new δ therefore is \approx \left [ \matrix{0.0688 & -0.1694} \right ] .

…And we're done with the second round. I'm not going to do this for x_0=(-1,2) , in case you were wondering, I hope I've shown that I can do this, and it's tedious to type out.

(ii)

Intuitively, this makes sense: We walked all the way to the minimum in the direction of the current gradient, so the next direction should not go "back" in the direction where we came from, or further than we needed to go.

4.4.1

Not gonna show you the sketch, of course. (I also don't have time right now to learn manim for that, though it would be cool).

(i)

The solution is \frac{-π}{2} .

(ii)

The solution is \left [ \matrix{-\frac{\sqrt{2}}{2} \cr -\frac{\sqrt{2}}{2}} \right ] .

(iii)

Since the gradient is "pulling" to the bottom left, the best one can do is to set x_1 to zero. So x=\left [ \matrix{-\frac{\sqrt{2}}{2} \cr -\frac{\sqrt{2}}{2}} \right ] .

(iv)

The second constraint can be pictured the following way: On the x_2 axis there is a parabola forming a paraboloid cylinder which is zero where x_2 is zero, and grows symmetrically in both other directions. At x_2≥1 and x_2≤-1 this cylinder is greater than x_1 , which we want to avoid. So we know that x_2=-1 .

But what is x_1 ? It should be as small as possible. Plugging in x_1=0 tells us that this is the best we can do.