author: niplav, created: 2022-10-19, modified: 2022-12-20, language: english, status: in progress, importance: 2, confidence: likely
Solutions to the textbook “Maths for Intelligent Systems”.
Let me start with an example: We have three real-valued quantities
$x, g$
and$f$
which depend on each other. Specifically, $f(x,g)=3x+2g$ and$g(x)=2x$
.
Question: What is the “derivative of$f$
w.r.t.$x$
”?
Intuitively, I'd say that $\frac{\partial}{\partial x}f(x,g)=3$
. But then I notice that $g$
is allegedly a "real-valued quantity", what is that supposed to mean? Is
it not a function?
Alas, plugging in $g$
into $f$
gives $f(x)=3x+2(2x)$
and
$\frac{\partial}{\partial x}f(x)=3+4=7$
.
I… I don't know what the skew matrix is :-/, and Wikipedia isn't very helpful (I don't think it's the skew-Hermitian matrix or the skew-symmetric matrix or the skew-Hamiltonian matrix).
Writing code: This I can do.
using Random, LinearAlgebra
function gradient_check(x, f, df):
n=length(x)
d=length(f(x))
ε=10^-6
J=zero(Matrix{Float64}(undef, d, n))
for i in 1:n
unit=zero(rand(n))
unit[i]=1
J[:,i]=(f(x+ε*unit)-f(x-ε*unit))/(2*ε)
end
if norm(J-df(x), Inf)<10^-4
return true
else
return false
end
end
julia> A=rand(Float64, (10, 15))
julia> f(x)=A*x
julia> df(x)=A
julia> x=randn(15)
15-element Vector{Float64}:
1.536516645971545
1.0136394994998532
-0.09863977762813898
1.3510191388362935
0.84503226122143
0.09296670831415606
-1.5390337565597376
1.4679194319980104
-0.7085023577127753
-0.10676335224166593
-0.8686753109089055
1.2912744597257453
0.7364123079861109
0.5736005534388826
0.5332386427039576
julia> gradient_check(x, f, df)
true
And now the cooler $f$
:
julia> f(x)=transpose(x)*x
f (generic function with 1 method)
julia> df(x)=2*transpose(x)
df (generic function with 1 method)
The derivative of $σ(W_0 \times x_0)$
, using the chain rule and the derivative
of $\frac{dσ}{dx}=σ'$
, is $σ'(W_0 \times x_0) \times W_0$
.
Applying this again for $W_1 \times σ(W_0 \times x_0)$
,
we get $W_1 \times σ'(W_0 \times x_0) \times W_0$
.
Again: $\frac{d}{d x_0} σ(W_1 \times σ(W_0 \times x_0))=σ'(W_1 \times σ(W_0 \times x_0)) \times W_1 \times σ'(W_0 \times x_0) \times W_0$
.
And finally:
$\frac{d}{d x_0} W_2 \times σ(W_1 \times σ(W_0 \times x_0))=W_2 \times σ'(W_1 \times σ(W_0 \times x_0)) \times W_1 \times σ'(W_0 \times x_0) \times W_0$
.
Then the formula for computing $\frac{d f}{d x_0}$
is $W_2 \times \prod_{l=0}^{m-1} σ'(z_{l+1}) \times W_l$
,
where $m$
is the number of matrices, and $\prod$
is left matrix
multiplication.