This PDF 1.5 document has been generated by LaTeX with hyperref package / pdfTeX-1.40.12, and has been sent on pdf-archive.com on 23/07/2013 at 20:09, from IP address 87.2.x.x.
The current document download page has been viewed 2260 times.
File size: 202.45 KB (13 pages).
Privacy: public file
Matrix Calculus
Massimiliano Tomassoli
23/7/2013
Matrix Calculus is a set of techniques which let us differentiate functions of matrices without computing the single partial derivatives by hand. What this means will be clear in a moment.
1
The derivative
Let’s talk about notation a little bit. If f (X) is a scalar function of an m × n matrix X, then
∂f (X)
∂f (X)
·
·
·
∂x1n
∂f (X) ∂x.11 .
.. .
..
= ..
.
∂X
∂f (X)
∂f (X)
· · · ∂xmn
∂xm1
The definition above is valid even if m = 1 or n = 1, that is if f is a function of a row vector, a column
vector or a scalar, in which case the result is a row vector, a column vector or a scalar, respectively.
If f (X) is an m × n matrix function of a matrix, then
∂f (X)
∂f1n (X)
11
·
·
·
∂X
∂f (X) ∂X
..
...
= ...
.
∂X
∂fmn (X)
∂fm1 (X)
···
∂X
∂X
where the matrix above is a block matrix. The definition above is valid even when m = 1 or n = 1,
that is when f is a row vector function, a column vector function or a scalar function, in which case
the block matrix is a row of blocks, a column of blocks or just a single block, respectively.
If f is
• a scalar function of a scalar, vector or matrix, or
• a vector function of a scalar or vector, or
• a matrix function of a scalar,
then the derivative, also called Jacobian matrix, of f is
Df (x) =
∂f (x)
∂xT
For instance, if f (x) is a vector function of a vector, then
∂f (x) ∂f1 (x)
1
∂x1
∂xT
Df (x) :=
∂f (x) . .
= .. = ..
∂xT
∂fm (x)
∂fm (x)
∂xT
∂x1
···
..
.
···
∂f1 (x)
∂xn
..
.
∂fm (x)
∂xn
As another example, if f (X) is a scalar function of a matrix, then
∂f (X)
(X)
· · · ∂f
∂x11
∂xm1
∂f (X) .
.. .
..
Df (X) :=
= ..
.
.
T
∂X
∂f (X)
∂f (X)
· · · ∂xmn
∂x1n
1
.
Some authors prefer to find the gradient of a function, which is defined as the transpose of the Jacobian
matrix.
Now, how do we define the derivative of an m × n matrix function f of a p × q matrix? It’s clear
that we have mnpq partial derivatives. We could just define the derivative of f as we did above for the
other cases, but we’ll follow Magnus’s way [1, 2] and give the following definition:
Df (X) =
∂ vec f (X)
∂(vec X)T
The result is an mn × pq matrix. We’ll talk about the vec operation in a moment. For now let’s just
say that it vectorizes a matrix by stacking its columns on top of one another. More formally,
T
vec(A) = a11 · · · am1 a12 · · · am2 · · · · · · a1n · · · amn
This means that vec f (X) is a vector function and vec X is a vector.
So what about a scalar function of a matrix? According to this last definition, the derivative should
be a row vector, while according to our first definition of derivative it should be a matrix. Both are
viable options and we’ll see how easy it is to go from one to the other.
The second derivative in the multidimensional case is the Hessian matrix (or just Hessian), a square
matrix of second-order partial derivatives of a scalar function. More precisely, if f is a scalar function
of a vector, then the Hessian is defined as
∂ 2 f (x) ∂ 2 f (x)
∂ 2 f (x)
· · · ∂x
∂x1 ∂x2
∂x21
1 ∂xn
∂ 2 f (x) ∂ 2 f (x)
∂ 2 f (x)
2
· · · ∂x2 ∂xn
∂ f (x)
∂x2 ∂x1
∂x22
=
Hf (x) =
.
.
..
T
..
∂x∂x
.
.
.
.
.
.
∂ 2 f (x)
∂ 2 f (x)
∂ 2 f (x)
···
∂xn ∂x1
∂xn ∂x2
∂x2
n
Note that the Hessian is the Jacobian of the gradient (i.e. the transpose of the Jacobian) of f :
Hf (x) =
∂ 2 f (x)
∂ ∂f (x)
=
= J(∇f )(x)
∂x∂xT
∂xT ∂x
If f is a scalar function of a matrix (rather than of a vector), we vectorize the matrix:
Hf (X) =
∂ 2 f (X)
∂(vec X)∂(vec X)T
In this article we assume that the second-order partial derivatives are continuous so the order of differentiation is irrelevant and the Hessian is always symmetric.
2
The differential
The method described here is based on differentials, therefore let’s recall what a differential is.
In the one-dimensional case, the derivative of f at x is defined as
f (x + u) − f (x)
= f 0 (x)
u→0
u
lim
This can be rewritten as
f (x + u) = f (x) + f 0 (x)u + rx (u)
where rx (u) is o(u), i.e. rx (u)/u → 0 as u → 0. The differential of f at x with increment u is
df (x; u) = f 0 (x)u.
In the vector case, we have
f (x + u) = f (x) + (Df (x))u + rx (u)
2
and the differential of f at x with increment u is df (x; u) = (Df (x))u. As we can see, the differential
is the best linear approximation of f (x + u) − f (x) at x. In practice, we write dx instead of u, so,
for instance, df (x; u) = f 0 (x)dx. We’ll justify this notation in a moment, but first let’s introduce the
so-called identification results. They are needed to get the derivatives from the differentials.
The first result says that, if f is a vector function of a vector,
df (x) = A(x)dx ⇐⇒ Df (x) = A(x).
More generally, if f is a matrix function of a matrix,
d vec f (X) = A(X)d vec X ⇐⇒ Df (x) = A(X).
The second result is about the second differential and says that if f is a scalar function of a vector,
then
1
d2 f (x) = (dx)T B(x)dx ⇐⇒ Hf (x) = (B(x) + B(x)T )
2
where Hf (x) denotes the Hessian matrix.
Another important result is Cauchy’s rule of invariance which says that if h(x) = g(f (x)), then
dh(x; u) = dg(f (x); df (x; u))
This is related to the chain rule for the derivatives; in fact, it can be proved by making use of the chain
rule:
dh(x; u) = Dh(x)u
= D(g ◦ f )(x)u
= Dg(f (x))Df (x)u
= Dg(f (x))df (x; u)
= dg(f (x); df (x; u))
Now we can justify the abbreviated notation dx. If y = f (x), then we write
dy = df (x; dx)
where x and y are variables. Basically, we name the differential after the variables rather than after the
functions. But now suppose that x = g(t) and dx = dg(t; dt). We now have y = f (g(t)) = h(t) and,
therefore,
dy = dh(t; dt).
For our abbreviated notation to be consistent, it must be the case that df (x; dx) = dh(t; dt). Fortunately,
thanks to Cauchy’s rule of invariance, we can see that it is so:
dh(t; dt) = d(f ◦ g)(t; dt) = df (g(t); dg(t; dt)) = df (x; dx).
3
Two important operators
Before proceeding with the actual computation of differentials and derivatives, we need to introduce
two important operators: the Kronecker product and the vec operator. We’ve already talked a little
about the vec operator but here we’ll see and prove some useful results for manipulating expressions
involving these two operators.
As we said before, the vec operator is defined as follows:
T
vec(A) = a11 · · · am1 a12 · · · am2 · · · · · · a1n · · · amn
For instance,
T
1 2 3
vec
= 1 4 2 5 3 6
4 5 6
3
The Kronecker product between two matrix A and B is defined as follows:
a11 B · · · a1n B
..
...
A ⊗ B = ...
.
am1 B · · ·
amn B
For instance,
1
1 2
1 1
1
⊗
=
3
3 4
1 1
3
1
1
3
3
2
2
4
4
2
2
4
4
Here’s a list of properties of the Kronecker product and the vec operator:
1. A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C
(associativity)
2. A ⊗ (B + C) = (A ⊗ B) + (A ⊗ C)
(A + B) ⊗ C = (A ⊗ C) + (B ⊗ C)
3. ∀a ∈ R,
(distributivity)
a ⊗ A = A ⊗ a = aA
4. ∀a, b ∈ R,
aA ⊗ bB = ab(A ⊗ B)
(A ⊗ B)(C ⊗ D) = AC ⊗ BD
5. For conforming matrices,
6. (A ⊗ B)T = AT ⊗ B T ,
7. For all vectors a and b,
(A ⊗ B)H = AH ⊗ B H
aT ⊗ b = baT = b ⊗ aT
8. (A ⊗ B)−1 = A−1 ⊗ B −1
9. The vec operator is linear.
10. For all vectors a and b,
vec(abT ) = b ⊗ a
11. vec(AXC) = (C T ⊗ A) vec(X)
12. tr(AB) = vec(AT )T vec(B)
We’ll denote with {f (i, j)} the matrix whose generic (i, j) element is f (i, j). Note that if A is a
block matrix with blocks Bij , then
B11 ⊗ C · · · B1n ⊗ C
..
..
..
A⊗C =
.
.
.
Bm1 ⊗ C · · · Bmn ⊗ C
Point (3) follows directly from the definition, while point (4) can be proved as follows:
xA ⊗ yB = {xaij (yB)} = {xyaij B} = xy{aij B} = xy(A ⊗ B)
Now let’s prove the other points one by one.
1. A ⊗ (B ⊗ C) = {aij (B ⊗ C)} = {aij B ⊗ C} = {aij B} ⊗ C = (A ⊗ B) ⊗ C
2. A ⊗ (B + C) = {aij (B + C)} = {aij B + aij C} = {aij B} + {aij C} = A ⊗ B + A ⊗ C
The other case is analogous.
3. See above.
4. See above.
4
P
P
P
5. (A ⊗ B)(C ⊗ D) = {aij B}{cij D} = { k aik Bckj D} = {( k aik ckj ) BD} = { k aik ckj } ⊗ BD =
AC ⊗ BD
6. (A ⊗ B)T = {aij B}T = {aji B T } = AT ⊗ B T
The other case is analogous.
7. This is very easy.
8. By (5),
(A ⊗ B)(A−1 ⊗ B −1 ) = AA−1 ⊗ BB −1 = I ⊗ I = I
9. This is very easy.
a1 b 1
..
T
10. vec(ab ) = vec .
am b 1
· · · a1 b n
.. = vec b a · · ·
..
.
1
.
· · · am b n
b1 a
bn a = ... = b ⊗ a
bn a
11. Let x1 , . . . , xn be the columns of the matrix X and P
e1 , . . . , en the columns of the identity matrix
of order n. You should convince yourself that X = k xk eTk . Here we go:
! !
X
vec(AXC) = vec A
xk eTk C
k
!
X
= vec
(Axk )(eTk C)
k
=
X
vec((Axk )(eTk C))
(by (9))
((eTk C)T ⊗ (Axk ))
(by (10))
k
=
X
k
=
X
((C T ek ) ⊗ (Axk ))
=
X
k
((C T ⊗ A)(ek ⊗ xk ))
(by (5))
k
= (C T ⊗ A)
X
(ek ⊗ xk )
k
= (C T ⊗ A)
X
vec(xk eTk )
(by (10))
k
= (C T ⊗ A) vec
X
(xk eTk )
(by (9))
k
= (C T ⊗ A) vec(X)
12. The trace of the product AB is just the sum of all the products aij bij :
XX
X
XX
tr(AB) =
[AB]ii =
aik bki =
[AT ]ki bki = vec(AT )T vec(B)
i
i
i
k
k
Since we’ll be using the trace operator quite a bit, recall that:
1. tr is linear.
2. tr(A) = tr(AT )
3. tr(AB) = tr(BA)
5
The first two properties are trivial (but still very useful). Let’s see why the last one is also true:
X
XX
XX
X
tr(AB) =
[AB]ii =
aik bki =
bki aik =
[BA]kk = tr(BA)
i
i
k
k
i
k
Also note that the trace of a scalar is the scalar itself. This means that when we have a scalar function
we can add a trace. This simple observation will be very useful.
4
Basic rules of differentiation
In order to be able to differentiate expressions, we’ll need a set of simple rules:
1. dA = 0,
where A is constant
2. d(αX) = αdX,
where α is a scalar
3. d(X + Y ) = dX + dY
4. d(tr(X)) = tr(dX)
5. d(XY ) = (dX)Y + XdY
6. d(X ⊗ Y ) = (dX) ⊗ Y + X ⊗ dY
7. d(X −1 ) = −X −1 (dX)X −1
8. d|X| = |X| tr(X −1 dX)
9. d log |X| = tr(X −1 dX)
10. d(X ? ) = (dX)? ,
vec
where
?
is any operator which rearranges elements such as transpose and
A matrix is really a matrix of scalar functions and the differential of a matrix is the matrix of
the differentials of the single scalar functions. MorePformally, [dX]ij = d(Xij ). Remember that if
f is a scalar function of a vector, then df (x; u) =
i Di f (x)ui , where the Di f (x) are the partial
derivativesP
of f at x. If f is, instead, a function of a matrix, we just generalize the previous relation:
df (x; u) = ij Dij f (x)uij . That said, many of the rules above can be readily proved.
As an example, let’s prove the product rule (5). Let f and g be two scalar functions of a matrix.
Then
X
d(f g)(x; u) =
Dij (f g)(x)uij
i,j
=
X
=
X
((Dij f (x))g(x) + f (x)Dij g(x))uij
i,j
(Dij f (x))uij g(x) +
i,j
X
f (x)Dij g(x)uij
i,j
= df (x; u)g(x) + f (x)dg(x; u)
6
where we used the usual product rule for derivatives. Now we can use this result to prove the general
one about matrices of scalar functions:
[d(XY )]ij = d[XY ]ij
!
X
=d
xik ykj
k
=
X
d(xik ykj )
k
=
X
X
(dxik ykj +
xik dykj
k
k
X
X
=
[dX]ik ykj +
xik [dY ]kj
k
k
= [(dX)Y ]ij + [XdY ]ij
= [(dX)Y + XdY ]ij
Now we can prove (7) by using the product rule:
0 = dI = d(XX −1 ) = (dX)X −1 + Xd(X −1 ) =⇒ dX −1 = −X −1 (dX)X −1
P
To prove (8) we observe that, for any i = 1, . . . , n, we have |X| = j xij Cij , where Cij is the cofactor
of the element xij , i.e. (−1)i+j times the determinant of the matrix obtained by removing from X the
i-th row and the j-th column. Because Cij doesn’t depend on xij , then
∂|X| X ∂xij Cij
=
= Cij .
∂xij
∂x
ij
j
1
Now note that C is the matrix of the cofactors and recall that X −1 = |X|
C T and thus C T = |X|X −1 .
This last result will be used in the following derivation:
X
X
X
d|X| = d| · |(X; dX) =
Cij [dX]ij =
[C T ]ji [dX]ij =
[C T dX]jj = tr(C T dX) = |X| tr(X −1 dX)
i,j
i,j
j
Note that (9) follows directly from (8).
5
The special form tr(AdX)
At the beginning of this article we gave two definitions of the derivative of a scalar function of a matrix.
The first is
∂f (X)
Df (X) =
∂X T
and the second is
∂f (X)
Df (X) =
.
∂(vec X)T
In the first case the result is a matrix whereas in the second the result is a row vector. Magnus suggests
to use the second definition, but we’ll opt for the first one.
Let’s consider the differential tr(AdX). We can find the derivative according to the second definition above by using property (12) in the section “Two important operators”. We’ll also use rule of
differentiation (10) with ? = vec. We get
tr(AdX) = vec(AT )T vec dX = vec(AT )T d vec X
and, therefore,
∂ tr(AdX)
= vec(AT )T
T
∂(vec X)
7
To get the result in matrix form (first definition) we need to unvectorize the result above. We’ll proceed
step by step:
∂ tr(AdX)
∂ tr(AdX)
∂ tr(AdX)
∂ tr(AdX)
= vec(AT ) =⇒
= AT =⇒
= vec(AT )T =⇒
= A.
T
∂(vec X)
∂ vec X
∂X
∂X T
So, the result is simply A. As we saw in the previous section,
d|X| = |X| tr(X −1 dX) = tr(|X|X −1 dX).
This means that the derivative is |X|X −1 = C T which agree with the result
∂|X|
= Cij
∂xij
that we derived in the previous section.
6
Examples
In practice, the derivative of an expression involving matrices can be computed by using the rules of
differentiation to get a result of the form φ(X)dX, tr(φ(X)dX) or φ(X)d vec X. After having done
that, we can read off the derivative which is simply φ(X).
To find the Hessian we must differentiate a second time, that is we must differentiate f (x) and find
df (x; dx), and then differentiate df (x; dx) itself by keeping in mind that dx is a constant increment and
thus, for instance, d(aT dx) = 0 and d(xT Adx) = dxT Adx.
According to the second identification result, we must get a result of the form dxT φ(x)dx or of the
form (d vec X)T φ(X)d vec X and the Hessian is 21 (φ(x) + φ(x)T ). Note that if φ(x) is symmetric then
the Hessian is just φ(x).
From time to time we’ll need to use the commutation matrix Kmn , which is the permutation matrix
satisfying
Kmn vec X = vec(X T )
where X is m × n. It can be shown that
T
−1
Kmn
= Kmn
= Knm
In particular, Knn = Kn is symmetric. Another useful fact is the following. If A is an m × n matrix, B
a p × q matrix and X a q × n matrix, then
(B ⊗ A)Kqn = Kpm (A ⊗ B)
Let’s prove that:
Kpm (A ⊗ B) vec X = Kpm vec(BXAT )
= vec((BXAT )T )
= vec(AX T B T )
= (B ⊗ A) vec(X T )
= (B ⊗ A)Kqn vec X
which is true for all vec X ∈ Rqn×1 and hence the proof is complete.
In this section we’ll see some examples of computation of the derivative, of the Hessian, and then
an example of maximum likelihood estimation with a multivariate Gaussian distribution,
Matrices will be written in uppercase and vectors in lowercase.
8
Let’s get started!
f (x) = aT x
d(aT x) = aT dx
=⇒ Df (x) = aT
f (x) = xT Ax
d(xT Ax) = d(xT )Ax + xT d(Ax)
= (dx)T Ax + xT Adx
= xT AT dx + xT Adx
= xT (AT + A)dx
=⇒ Df (x) = xT (AT + A)
f (X) = aT Xb
d(aT Xb) = d tr(aT Xb)
= tr(aT d(X)b)
= tr(baT dX)
=⇒ Df (X) = baT
f (X) = aT XX T a
d(aT XX T a) = tr(aT d(XX T )a)
= tr(aaT d(XX T ))
= tr(aaT ((dX)X T + XdX T ))
= tr(aaT (dX)X T ) + tr((dX)X T aaT )
= tr(X T aaT dX) + tr(X T aaT dX)
= 2 tr(X T aaT dX)
=⇒ Df (X) = 2X T aaT
f (X) = tr(AX T BXC)
d tr(AX T BXC) = tr(Ad(X T )BXC) + tr(AX T B(dX)C)
= tr(C T X T B T (dX)AT ) + tr(CAX T BdX)
= tr((AT C T X T B T + CAX T B)dX)
=⇒ Df (X) = AT C T X T B T + CAX T B
f (X) = tr(AX −1 B)
f (X) = |X T X|
d tr(AX −1 B) = tr(BAd(X −1 ))
= − tr(BAX −1 (dX)X −1 )
= − tr(X −1 BAX −1 dX)
=⇒ Df (X) = −X −1 BAX −1
d|X T X| = |X T X| tr((X T X)−1 d(X T X))
= |X T X| tr((X T X)−1 (d(X T )X + X T dX))
= |X T X| tr((X T X)−1 d(X T )X) + tr((X T X)−1 X T dX)
= |X T X| tr(X T (dX)(X T X)−1 ) + tr((X T X)−1 X T dX)
= 2|X T X| tr((X T X)−1 X T dX)
=⇒ Df (X) = 2|X T X|(X T X)−1 X T
9
f (X) = tr(X p )
f (X) = Xa
d tr(X p ) = tr((dX)X p−1 + X(dX)X p−2 + · · · + X p−1 dX)
= tr(X p−1 dX + X p−1 dX + · · · + X p−1 dX)
= p tr(X p−1 dX)
=⇒ Df (X) = pX p−1
d(Xa) = (dX)a
= vec((dX)a)
= vec(In (dX)a)
(the vec of a vector is the vector itself)
= (aT ⊗ In )d vec X
=⇒ Df (X) = aT ⊗ In
To differentiate a matrix we need to vectorize it:
f (x) = xxT
d vec(xxT ) = vec((dx)xT + xdxT )
= vec(In (dx)xT ) + vec(x(dx)T In )
= (x ⊗ In )d vec x + (In ⊗ x)d vec(xT )
= (x ⊗ In )d vec x + (In ⊗ x)d vec x
=⇒ Df (x) = x ⊗ In + In ⊗ x
f (X) = X 2
d vec(X 2 ) = vec((dX)X + XdX)
= vec(In (dX)X + X(dX)In )
= (X T ⊗ In + In ⊗ X)d vec X
=⇒ Df (X) = X T ⊗ In + In ⊗ X
Now let’s compute some Hessians.
f (x) = aT x
d(aT x) = aT dx
d(aT dx) = 0
=⇒ Hf (x) = 0
f (X) = tr(AXB)
f (x) = xT Ax
d tr(AXB) = tr(a(dX)B)
= tr(BAdX)
d tr(BAdX) = 0
=⇒ Hf (x) = 0
d(xT Ax) = dxT Ax + xT Adx
= xT AT dx + xT Adx
= xT (AT + A)dx
d(xT (AT + A)dx) = dxT (AT + A)dx
=⇒ Hf (x) = AT + A
10
f (X) = tr(X T X)
d tr(X T X) = tr(dX T X + X T dX)
= tr(X T dX + X T dX)
= 2 tr(X T dX)
d(2 tr(X T dX)) = 2 tr(dX T dX)
= 2(d vec X)T d vec X
=⇒ Hf (X) = 2Imn
f (X) = tr(AX T BX)
(X is m × n)
d tr(AX T BX) = tr(AdX T BX + AX T BdX)
= tr(X T B T dXAT + AX T BdX)
= tr(AT X T B T dX + AX T BdX)
d(tr(AT X T B T dX + AX T BdX)) = tr(AT dX T B T dX + AdX T BdX)
= tr(dX T BdXA + AdX T BdX)
= 2 tr(AdX T BdX)
= 2 tr(dX T B(dX)A)
= 2(d vec X)T vec(B(dX)A)
= 2(d vec X)T (AT ⊗ B)d vec X
1
=⇒ Hf (X) = (2AT ⊗ B + 2A ⊗ B T )
2
=⇒ Hf (X) = AT ⊗ B + A ⊗ B T
Here we’ll make use of the commutation matrix Kmn . Remember that Knn = Kn is symmetric.
f (X) = tr(X 2 )
d tr(X 2 ) = tr((dX)X + XdX)
= 2 tr(XdX)
d(2 tr(XdX)) = 2 tr(dXdX)
(X is n × n)
= 2(vec(dX T ))T d vec X
= 2(Kn vec(dX))T d vec X
= 2(d vec X)T Kn d vec X
=⇒ Hf (X) = 2Kn
f (X) = tr(AXBX)
d tr(AXBX) = tr(A(dX)BX + AXBdX)
= tr(BXAdX + AXBdX)
d tr(BXAdX + AXBdX) = tr(B(dX)AdX + A(dX)BdX)
= 2 tr(A(dX)BdX)
= 2 tr((dX)B(dX)A)
(X is m × n)
= 2(vec(dX T ))T vec(B(dX)A)
= 2(vec(dX T ))T (AT ⊗ B)d vec X
= 2(Kmn vec(dX))T (AT ⊗ B)d vec X
= 2(d vec X)T Knm (AT ⊗ B)d vec X
=⇒ Hf (X) = Knm (AT ⊗ B) + (A ⊗ B T )Kmn
=⇒ Hf (X) = Knm (AT ⊗ B + B T ⊗ A)
11
In the last step above we used the fact that (A ⊗ B T )Kmn = Knm (B T ⊗ A).
f (X) = aT XX T a
d(aT XX T a) = aT (dX)X T a + aT X(dX)T a
= aT (dX)X T a + aT (dX)X T a
= 2aT (dX)X T a
d(2aT (dX)X T a) = 2aT (dX)(dX)T a
= 2 tr(aT (dX)(dX)T a)
= 2 tr((dX)T aaT dX)
= 2(d vec X)T vec(aaT dX)
= 2(d vec X)T vec(aaT (dX)I)
= 2(d vec X)T (I ⊗ aaT )d vec X
=⇒ Hf (X) = 2(I ⊗ aaT )
f (X) = tr(X −1 )
d tr(X −1 ) = tr(d(X −1 ))
= − tr(X −1 (dX)X −1 )
d(− tr(X −1 (dX)X −1 )) = − tr(d(X −1 )(dX)X −1 + X −1 (dX)d(X −1 ))
= − tr(−X −1 (dX)X −1 (dX)X −1 − X −1 (dX)X −1 (dX)X −1 )
= 2 tr((dX)X −1 (dX)X −2 )
= 2(vec(dX T ))T vec(X −1 (dX)X −2 )
= 2(Kn vec(dX))T vec(X −1 (dX)X −2 )
= 2(d vec X)T Kn (X −2T ⊗ X −1 )d vec X
=⇒ Hf (X) = Kn (X −2T ⊗ X −1 ) + (X −2 ⊗ X −T )Kn
=⇒ Hf (X) = Kn (X −2T ⊗ X −1 + X −T ⊗ X −2 )
f (X) = |X|
d|X| = |X| tr(X −1 dX)
d(|X| tr(X −1 dX)) = d|X| tr(X −1 dX) + |X| tr(d(X −1 )dX)
= |X|(tr(X −1 dX))2 − |X| tr(X −1 (dX)X −1 dX)
= |X| tr((dX)T X −T ) tr(X −1 dX) − |X| tr((dX)X −1 (dX)X −1 )
= |X|(d vec X)T vec(X −T )(vec(X −T ))T d vec X
− |X|(vec(dX T ))T (X −T ⊗ X −1 )d vec X
= |X|(d vec X)T vec(X −T )(vec(X −T ))T − Kn (X −T ⊗ X −1 ) d vec X
=⇒ Hf (X) = |X| vec(X −T )(vec(X −T ))T − Kn (X −T ⊗ X −1 )
Note that the Hessian above (like all the others) is symmetric; in fact,
(Kn (X −T ⊗ X −1 ))T = (X −1 ⊗ X −T )Kn = Kn (X −T ⊗ X −1 )
We conclude this section with an example about MLE.
Given a set of vectors x1 , . . . , xN drawn independently from a multivariate Gaussian distribution, we
want to estimate the parameters of the distribution by maximum likelihood. Note that the covariance
matrix Σ, and thus Σ−1 , is symmetric. The log likelihood function is
N
N
1X
ND
ln(2π) − ln |Σ| −
(xn − µ)T Σ−1 (xn − µ).
ln p(x1 , . . . , xN |µ, Σ) = −
2
2
2 n=1
12
We first find the derivative with respect to µ:
!
N
N
1X
1X
T −1
(xn − µ) Σ (xn − µ) = −
d(xn − µ)T Σ−1 (xn − µ) + (xn − µ)T Σ−1 d(xn − µ)
d −
2 i=1
2 i=1
N
1X
=−
−dµT Σ−1 (xn − µ) − (xn − µ)T Σ−1 dµ
2 i=1
N
1X
=
(xn − µ)T Σ−1 dµ + (xn − µ)T Σ−1 dµ
2 i=1
!
N
X
=
(xn − µ)T Σ−1 dµ
i=1
N
∂ ln p X
=⇒
=
(xn − µ)T Σ−1
∂µT
i=1
Therefore,
N
X
i=1
T
(xn − µ) Σ
−1
= 0 ⇐⇒
N
X
xTn
n=1
= Nµ
T
N
1 X
⇐⇒ µ
ˆ=
xn
N i=1
Finally, we find the derivative with respect to Σ:
!
N
N
1X
1X
N
N
tr (xn − µ)T d(Σ−1 )(xn − µ)
(xn − µ)T Σ−1 (xn − µ) = − tr(Σ−1 dΣ) −
d − ln |Σ| −
2
2 i=1
2
2 n=1
N
N
1X
−1
= − tr(Σ dΣ) +
tr (xn − µ)T Σ−1 (dΣ)Σ−1 (xn − µ)
2
2 n=1
N
N
1X
−1
tr Σ−1 (xn − µ)(xn − µ)T Σ−1 dΣ
= − tr(Σ dΣ) +
2
2 n=1
N
∂ ln p
N −1 1 X −1
=⇒
=
−
Σ +
Σ (xn − µ)(xn − µ)T Σ−1
∂ΣT
2
2 i=1
Therefore,
N
N
N
1 X −1
N −1 1 X −1
Σ =
− Σ−1 +
Σ (xn − µ)(xn − µ)T Σ−1 = 0 ⇐⇒
Σ (xn − µ)(xn − µ)T Σ−1
2
2 i=1
2
2 i=1
−1
⇐⇒ N = Σ
ˆ= 1
⇐⇒ Σ
N
N
X
(xn − µ)(xn − µ)T
i=1
N
X
(xn − µ
ˆ)(xn − µ
ˆ)T
i=1
References
[1] Abadir, K. M., & Magnus, J. R. (2005). Matrix algebra. Cambridge, UK: Cambridge University
Press.
[2] Magnus, J. R. & Neudecker, H. (1999). Matrix differential calculus with applications in statistics
and economics. New York, NY: John Wiley & Sons
[3] Steven W. Nydick (2012). A Different(ial) Way: Matrix Derivatives Again.
[4] Thomas Minka (2000). Old and New Matrix Algebra Useful for Statistics.
13
MatrixCalculus.pdf (PDF, 202.45 KB)
Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by e-Mail, Messenger, Whatsapp, Line..
Use the short link to share your document on Twitter or by text message (SMS)
Copy the following HTML code to share your document on a Website or Blog