author: niplav, created: 2022-10-19, modified: 2022-11-25, language: english, status: in progress, importance: 2, confidence: likely
Solutions to the textbook “Introduction to Optimization”.
See here.
Some standard numpy code:
import numpy as np
def givec(x, c=10):
n=len(x)
iexp=np.array(range(0,n))
cdiag=c**(iexp/((n-1)*np.ones(n)))
C=np.diag(cdiag)
return C
def fsq(x, c=10):
C=givec(x, c=c)
return x.T@C@x
def fhole(x, c=10, a=0.1):
retval=fsq(x, c=c)
return retval/(a**2+retval)
The derivative of is (as per identity 81 from the Matrix Cookbook).
Which is excellent, since the identity 2.3 of Maths for Intelligent Systems (Toussaint 2022) gives which would be and therefore the same as above (and in identity 97 from the Matrix Cookbook).
So then
def fsqgrad(x, c=10):
C=givec(x, c=c)
return (C.T+C)@x
Since and return a scalar, one can use a simple quotient rule:
The implementation is straightforwardly
def fholegrad(x, c=10, a=0.1):
retval_fsq=fsq(x, c=c)
retval_fsqgrad=fsqgrad(x, c=c, a=a)
return (retval_fsqgrad*(a**2+retval_fsq)-retval_fsq*retval_fsqgrad)/((a**2+retval_fsq))**2
The function can be pretty directly translated from the pseudocode in the book to Python (no surprise here):
def graddesc(f, fgrad, alpha=0.001, theta=10e-20, x0=[1,1], c=10, a=0.1):
x=np.array(x0)
i=0
prevcost=-math.inf
curcost=f(x)
while abs(prevcost-curcost)>theta:
print("#iteration: ", i)
print("current cost: ", curcost)
prevcost=curcost
x=x-alpha*fgrad(x)
curcost=f(x)
i=i+1
print("solution: ", x)
print("iterations: ", i)
print("cost: ", f(x))
return x
Executing graddesc
with fsq
/fsqgrad
and fhole
/fholegrad
gives
~1000 iterations for finding a local minimum with fsq
, and takes ~185k
iterations for fhole
. But it finds the minimum:
print("fsq:")
graddesc(fsq, fsqgrad)
print("fhole:")
graddesc(fhole, fholegrad)
fsq:
solution: [4.98354923e-09 1.65118900e-84]
iterations: 9549
cost: 2.4835762963261225e-17
fhole:
solution: [ 4.13240000e-11 -7.98149419e-19]
iterations: 184068
cost: 1.7076729729684466e-19
def backtrack_graddesc(f, fgrad, alpha=0.05, theta=10e-15, x0=[1,1], c=10, a=0.1, rho_ls=0.01, rho_plus=1.2, rho_minus=0.5, delta_max=math.inf):
x=np.array(x0)
i=0
prevcost=-math.inf
curcost=f(x)
delta=-fgrad(x)/np.linalg.norm(fgrad(x))
while np.linalg.norm(alpha*delta)>=theta:
while f(x+alpha*delta)>f(x)+rho_ls*fgrad(x)@(alpha*delta).T:
alpha=rho_minus*alpha
prevcost=curcost
x=x+alpha*delta
alpha=min(rho_plus*alpha, delta_max)
curcost=f(x)
i=i+1
return x
First we compute the derivative of :
(since and ). So the negative gradient is
, so
Then
The derivative of that is . Setting it to zero gives .
The next then is (going by the values we computed) .
The new gradient then is
Now computing :
which is smaller than zero, so our new is .
Then is .
We now go through the whole exercise a second time:
Then the derivative is approximately , and the new is the solution to that, namely .
Then the new is . The new is , and , which is this time greater than zero so we can keep it.
The new therefore is .
…And we're done with the second round. I'm not going to do this for , in case you were wondering, I hope I've shown that I can do this, and it's tedious to type out.
Intuitively, this makes sense: We walked all the way to the minimum in the direction of the current gradient, so the next direction should not go "back" in the direction where we came from, or further than we needed to go.
Not gonna show you the sketch, of course. (I also don't have time right now to learn manim for that, though it would be cool).
The solution is .
The solution is .
Since the gradient is "pulling" to the bottom left, the best one can do is to set to zero. So .
The second constraint can be pictured the following way: On the axis there is a parabola forming a paraboloid cylinder which is zero where is zero, and grows symmetrically in both other directions. At and this cylinder is greater than , which we want to avoid. So we know that .
But what is ? It should be as small as possible. Plugging in tells us that this is the best we can do.