This PDF 1.5 document has been generated by LaTeX with hyperref package / pdfTeX-1.40.12, and has been sent on pdf-archive.com on 23/07/2013 at 20:09, from IP address 87.2.x.x.
The current document download page has been viewed 2260 times.

File size: 202.45 KB (13 pages).

Privacy: public file

Matrix Calculus

Massimiliano Tomassoli

23/7/2013

Matrix Calculus is a set of techniques which let us differentiate functions of matrices without computing the single partial derivatives by hand. What this means will be clear in a moment.

1

The derivative

Let’s talk about notation a little bit. If f (X) is a scalar function of an m × n matrix X, then

∂f (X)

∂f (X)

·

·

·

∂x1n

∂f (X) ∂x.11 .

.. .

..

= ..

.

∂X

∂f (X)

∂f (X)

· · · ∂xmn

∂xm1

The definition above is valid even if m = 1 or n = 1, that is if f is a function of a row vector, a column

vector or a scalar, in which case the result is a row vector, a column vector or a scalar, respectively.

If f (X) is an m × n matrix function of a matrix, then

∂f (X)

∂f1n (X)

11

·

·

·

∂X

∂f (X) ∂X

..

...

= ...

.

∂X

∂fmn (X)

∂fm1 (X)

···

∂X

∂X

where the matrix above is a block matrix. The definition above is valid even when m = 1 or n = 1,

that is when f is a row vector function, a column vector function or a scalar function, in which case

the block matrix is a row of blocks, a column of blocks or just a single block, respectively.

If f is

• a scalar function of a scalar, vector or matrix, or

• a vector function of a scalar or vector, or

• a matrix function of a scalar,

then the derivative, also called Jacobian matrix, of f is

Df (x) =

∂f (x)

∂xT

For instance, if f (x) is a vector function of a vector, then

∂f (x) ∂f1 (x)

1

∂x1

∂xT

Df (x) :=

∂f (x) . .

= .. = ..

∂xT

∂fm (x)

∂fm (x)

∂xT

∂x1

···

..

.

···

∂f1 (x)

∂xn

..

.

∂fm (x)

∂xn

As another example, if f (X) is a scalar function of a matrix, then

∂f (X)

(X)

· · · ∂f

∂x11

∂xm1

∂f (X) .

.. .

..

Df (X) :=

= ..

.

.

T

∂X

∂f (X)

∂f (X)

· · · ∂xmn

∂x1n

1

.

Some authors prefer to find the gradient of a function, which is defined as the transpose of the Jacobian

matrix.

Now, how do we define the derivative of an m × n matrix function f of a p × q matrix? It’s clear

that we have mnpq partial derivatives. We could just define the derivative of f as we did above for the

other cases, but we’ll follow Magnus’s way [1, 2] and give the following definition:

Df (X) =

∂ vec f (X)

∂(vec X)T

The result is an mn × pq matrix. We’ll talk about the vec operation in a moment. For now let’s just

say that it vectorizes a matrix by stacking its columns on top of one another. More formally,

T

vec(A) = a11 · · · am1 a12 · · · am2 · · · · · · a1n · · · amn

This means that vec f (X) is a vector function and vec X is a vector.

So what about a scalar function of a matrix? According to this last definition, the derivative should

be a row vector, while according to our first definition of derivative it should be a matrix. Both are

viable options and we’ll see how easy it is to go from one to the other.

The second derivative in the multidimensional case is the Hessian matrix (or just Hessian), a square

matrix of second-order partial derivatives of a scalar function. More precisely, if f is a scalar function

of a vector, then the Hessian is defined as

∂ 2 f (x) ∂ 2 f (x)

∂ 2 f (x)

· · · ∂x

∂x1 ∂x2

∂x21

1 ∂xn

∂ 2 f (x) ∂ 2 f (x)

∂ 2 f (x)

2

· · · ∂x2 ∂xn

∂ f (x)

∂x2 ∂x1

∂x22

=

Hf (x) =

.

.

..

T

..

∂x∂x

.

.

.

.

.

.

∂ 2 f (x)

∂ 2 f (x)

∂ 2 f (x)

···

∂xn ∂x1

∂xn ∂x2

∂x2

n

Note that the Hessian is the Jacobian of the gradient (i.e. the transpose of the Jacobian) of f :

Hf (x) =

∂ 2 f (x)

∂ ∂f (x)

=

= J(∇f )(x)

∂x∂xT

∂xT ∂x

If f is a scalar function of a matrix (rather than of a vector), we vectorize the matrix:

Hf (X) =

∂ 2 f (X)

∂(vec X)∂(vec X)T

In this article we assume that the second-order partial derivatives are continuous so the order of differentiation is irrelevant and the Hessian is always symmetric.

2

The differential

The method described here is based on differentials, therefore let’s recall what a differential is.

In the one-dimensional case, the derivative of f at x is defined as

f (x + u) − f (x)

= f 0 (x)

u→0

u

lim

This can be rewritten as

f (x + u) = f (x) + f 0 (x)u + rx (u)

where rx (u) is o(u), i.e. rx (u)/u → 0 as u → 0. The differential of f at x with increment u is

df (x; u) = f 0 (x)u.

In the vector case, we have

f (x + u) = f (x) + (Df (x))u + rx (u)

2

and the differential of f at x with increment u is df (x; u) = (Df (x))u. As we can see, the differential

is the best linear approximation of f (x + u) − f (x) at x. In practice, we write dx instead of u, so,

for instance, df (x; u) = f 0 (x)dx. We’ll justify this notation in a moment, but first let’s introduce the

so-called identification results. They are needed to get the derivatives from the differentials.

The first result says that, if f is a vector function of a vector,

df (x) = A(x)dx ⇐⇒ Df (x) = A(x).

More generally, if f is a matrix function of a matrix,

d vec f (X) = A(X)d vec X ⇐⇒ Df (x) = A(X).

The second result is about the second differential and says that if f is a scalar function of a vector,

then

1

d2 f (x) = (dx)T B(x)dx ⇐⇒ Hf (x) = (B(x) + B(x)T )

2

where Hf (x) denotes the Hessian matrix.

Another important result is Cauchy’s rule of invariance which says that if h(x) = g(f (x)), then

dh(x; u) = dg(f (x); df (x; u))

This is related to the chain rule for the derivatives; in fact, it can be proved by making use of the chain

rule:

dh(x; u) = Dh(x)u

= D(g ◦ f )(x)u

= Dg(f (x))Df (x)u

= Dg(f (x))df (x; u)

= dg(f (x); df (x; u))

Now we can justify the abbreviated notation dx. If y = f (x), then we write

dy = df (x; dx)

where x and y are variables. Basically, we name the differential after the variables rather than after the

functions. But now suppose that x = g(t) and dx = dg(t; dt). We now have y = f (g(t)) = h(t) and,

therefore,

dy = dh(t; dt).

For our abbreviated notation to be consistent, it must be the case that df (x; dx) = dh(t; dt). Fortunately,

thanks to Cauchy’s rule of invariance, we can see that it is so:

dh(t; dt) = d(f ◦ g)(t; dt) = df (g(t); dg(t; dt)) = df (x; dx).

3

Two important operators

Before proceeding with the actual computation of differentials and derivatives, we need to introduce

two important operators: the Kronecker product and the vec operator. We’ve already talked a little

about the vec operator but here we’ll see and prove some useful results for manipulating expressions

involving these two operators.

As we said before, the vec operator is defined as follows:

T

vec(A) = a11 · · · am1 a12 · · · am2 · · · · · · a1n · · · amn

For instance,

T

1 2 3

vec

= 1 4 2 5 3 6

4 5 6

3

The Kronecker product between two matrix A and B is defined as follows:

a11 B · · · a1n B

..

...

A ⊗ B = ...

.

am1 B · · ·

amn B

For instance,

1

1 2

1 1

1

⊗

=

3

3 4

1 1

3

1

1

3

3

2

2

4

4

2

2

4

4

Here’s a list of properties of the Kronecker product and the vec operator:

1. A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C

(associativity)

2. A ⊗ (B + C) = (A ⊗ B) + (A ⊗ C)

(A + B) ⊗ C = (A ⊗ C) + (B ⊗ C)

3. ∀a ∈ R,

(distributivity)

a ⊗ A = A ⊗ a = aA

4. ∀a, b ∈ R,

aA ⊗ bB = ab(A ⊗ B)

(A ⊗ B)(C ⊗ D) = AC ⊗ BD

5. For conforming matrices,

6. (A ⊗ B)T = AT ⊗ B T ,

7. For all vectors a and b,

(A ⊗ B)H = AH ⊗ B H

aT ⊗ b = baT = b ⊗ aT

8. (A ⊗ B)−1 = A−1 ⊗ B −1

9. The vec operator is linear.

10. For all vectors a and b,

vec(abT ) = b ⊗ a

11. vec(AXC) = (C T ⊗ A) vec(X)

12. tr(AB) = vec(AT )T vec(B)

We’ll denote with {f (i, j)} the matrix whose generic (i, j) element is f (i, j). Note that if A is a

block matrix with blocks Bij , then

B11 ⊗ C · · · B1n ⊗ C

..

..

..

A⊗C =

.

.

.

Bm1 ⊗ C · · · Bmn ⊗ C

Point (3) follows directly from the definition, while point (4) can be proved as follows:

xA ⊗ yB = {xaij (yB)} = {xyaij B} = xy{aij B} = xy(A ⊗ B)

Now let’s prove the other points one by one.

1. A ⊗ (B ⊗ C) = {aij (B ⊗ C)} = {aij B ⊗ C} = {aij B} ⊗ C = (A ⊗ B) ⊗ C

2. A ⊗ (B + C) = {aij (B + C)} = {aij B + aij C} = {aij B} + {aij C} = A ⊗ B + A ⊗ C

The other case is analogous.

3. See above.

4. See above.

4

P

P

P

5. (A ⊗ B)(C ⊗ D) = {aij B}{cij D} = { k aik Bckj D} = {( k aik ckj ) BD} = { k aik ckj } ⊗ BD =

AC ⊗ BD

6. (A ⊗ B)T = {aij B}T = {aji B T } = AT ⊗ B T

The other case is analogous.

7. This is very easy.

8. By (5),

(A ⊗ B)(A−1 ⊗ B −1 ) = AA−1 ⊗ BB −1 = I ⊗ I = I

9. This is very easy.

a1 b 1

..

T

10. vec(ab ) = vec .

am b 1

· · · a1 b n

.. = vec b a · · ·

..

.

1

.

· · · am b n

b1 a

bn a = ... = b ⊗ a

bn a

11. Let x1 , . . . , xn be the columns of the matrix X and P

e1 , . . . , en the columns of the identity matrix

of order n. You should convince yourself that X = k xk eTk . Here we go:

! !

X

vec(AXC) = vec A

xk eTk C

k

!

X

= vec

(Axk )(eTk C)

k

=

X

vec((Axk )(eTk C))

(by (9))

((eTk C)T ⊗ (Axk ))

(by (10))

k

=

X

k

=

X

((C T ek ) ⊗ (Axk ))

=

X

k

((C T ⊗ A)(ek ⊗ xk ))

(by (5))

k

= (C T ⊗ A)

X

(ek ⊗ xk )

k

= (C T ⊗ A)

X

vec(xk eTk )

(by (10))

k

= (C T ⊗ A) vec

X

(xk eTk )

(by (9))

k

= (C T ⊗ A) vec(X)

12. The trace of the product AB is just the sum of all the products aij bij :

XX

X

XX

tr(AB) =

[AB]ii =

aik bki =

[AT ]ki bki = vec(AT )T vec(B)

i

i

i

k

k

Since we’ll be using the trace operator quite a bit, recall that:

1. tr is linear.

2. tr(A) = tr(AT )

3. tr(AB) = tr(BA)

5

The first two properties are trivial (but still very useful). Let’s see why the last one is also true:

X

XX

XX

X

tr(AB) =

[AB]ii =

aik bki =

bki aik =

[BA]kk = tr(BA)

i

i

k

k

i

k

Also note that the trace of a scalar is the scalar itself. This means that when we have a scalar function

we can add a trace. This simple observation will be very useful.

4

Basic rules of differentiation

In order to be able to differentiate expressions, we’ll need a set of simple rules:

1. dA = 0,

where A is constant

2. d(αX) = αdX,

where α is a scalar

3. d(X + Y ) = dX + dY

4. d(tr(X)) = tr(dX)

5. d(XY ) = (dX)Y + XdY

6. d(X ⊗ Y ) = (dX) ⊗ Y + X ⊗ dY

7. d(X −1 ) = −X −1 (dX)X −1

8. d|X| = |X| tr(X −1 dX)

9. d log |X| = tr(X −1 dX)

10. d(X ? ) = (dX)? ,

vec

where

?

is any operator which rearranges elements such as transpose and

A matrix is really a matrix of scalar functions and the differential of a matrix is the matrix of

the differentials of the single scalar functions. MorePformally, [dX]ij = d(Xij ). Remember that if

f is a scalar function of a vector, then df (x; u) =

i Di f (x)ui , where the Di f (x) are the partial

derivativesP

of f at x. If f is, instead, a function of a matrix, we just generalize the previous relation:

df (x; u) = ij Dij f (x)uij . That said, many of the rules above can be readily proved.

As an example, let’s prove the product rule (5). Let f and g be two scalar functions of a matrix.

Then

X

d(f g)(x; u) =

Dij (f g)(x)uij

i,j

=

X

=

X

((Dij f (x))g(x) + f (x)Dij g(x))uij

i,j

(Dij f (x))uij g(x) +

i,j

X

f (x)Dij g(x)uij

i,j

= df (x; u)g(x) + f (x)dg(x; u)

6

where we used the usual product rule for derivatives. Now we can use this result to prove the general

one about matrices of scalar functions:

[d(XY )]ij = d[XY ]ij

!

X

=d

xik ykj

k

=

X

d(xik ykj )

k

=

X

X

(dxik ykj +

xik dykj

k

k

X

X

=

[dX]ik ykj +

xik [dY ]kj

k

k

= [(dX)Y ]ij + [XdY ]ij

= [(dX)Y + XdY ]ij

Now we can prove (7) by using the product rule:

0 = dI = d(XX −1 ) = (dX)X −1 + Xd(X −1 ) =⇒ dX −1 = −X −1 (dX)X −1

P

To prove (8) we observe that, for any i = 1, . . . , n, we have |X| = j xij Cij , where Cij is the cofactor

of the element xij , i.e. (−1)i+j times the determinant of the matrix obtained by removing from X the

i-th row and the j-th column. Because Cij doesn’t depend on xij , then

∂|X| X ∂xij Cij

=

= Cij .

∂xij

∂x

ij

j

1

Now note that C is the matrix of the cofactors and recall that X −1 = |X|

C T and thus C T = |X|X −1 .

This last result will be used in the following derivation:

X

X

X

d|X| = d| · |(X; dX) =

Cij [dX]ij =

[C T ]ji [dX]ij =

[C T dX]jj = tr(C T dX) = |X| tr(X −1 dX)

i,j

i,j

j

Note that (9) follows directly from (8).

5

The special form tr(AdX)

At the beginning of this article we gave two definitions of the derivative of a scalar function of a matrix.

The first is

∂f (X)

Df (X) =

∂X T

and the second is

∂f (X)

Df (X) =

.

∂(vec X)T

In the first case the result is a matrix whereas in the second the result is a row vector. Magnus suggests

to use the second definition, but we’ll opt for the first one.

Let’s consider the differential tr(AdX). We can find the derivative according to the second definition above by using property (12) in the section “Two important operators”. We’ll also use rule of

differentiation (10) with ? = vec. We get

tr(AdX) = vec(AT )T vec dX = vec(AT )T d vec X

and, therefore,

∂ tr(AdX)

= vec(AT )T

T

∂(vec X)

7

To get the result in matrix form (first definition) we need to unvectorize the result above. We’ll proceed

step by step:

∂ tr(AdX)

∂ tr(AdX)

∂ tr(AdX)

∂ tr(AdX)

= vec(AT ) =⇒

= AT =⇒

= vec(AT )T =⇒

= A.

T

∂(vec X)

∂ vec X

∂X

∂X T

So, the result is simply A. As we saw in the previous section,

d|X| = |X| tr(X −1 dX) = tr(|X|X −1 dX).

This means that the derivative is |X|X −1 = C T which agree with the result

∂|X|

= Cij

∂xij

that we derived in the previous section.

6

Examples

In practice, the derivative of an expression involving matrices can be computed by using the rules of

differentiation to get a result of the form φ(X)dX, tr(φ(X)dX) or φ(X)d vec X. After having done

that, we can read off the derivative which is simply φ(X).

To find the Hessian we must differentiate a second time, that is we must differentiate f (x) and find

df (x; dx), and then differentiate df (x; dx) itself by keeping in mind that dx is a constant increment and

thus, for instance, d(aT dx) = 0 and d(xT Adx) = dxT Adx.

According to the second identification result, we must get a result of the form dxT φ(x)dx or of the

form (d vec X)T φ(X)d vec X and the Hessian is 21 (φ(x) + φ(x)T ). Note that if φ(x) is symmetric then

the Hessian is just φ(x).

From time to time we’ll need to use the commutation matrix Kmn , which is the permutation matrix

satisfying

Kmn vec X = vec(X T )

where X is m × n. It can be shown that

T

−1

Kmn

= Kmn

= Knm

In particular, Knn = Kn is symmetric. Another useful fact is the following. If A is an m × n matrix, B

a p × q matrix and X a q × n matrix, then

(B ⊗ A)Kqn = Kpm (A ⊗ B)

Let’s prove that:

Kpm (A ⊗ B) vec X = Kpm vec(BXAT )

= vec((BXAT )T )

= vec(AX T B T )

= (B ⊗ A) vec(X T )

= (B ⊗ A)Kqn vec X

which is true for all vec X ∈ Rqn×1 and hence the proof is complete.

In this section we’ll see some examples of computation of the derivative, of the Hessian, and then

an example of maximum likelihood estimation with a multivariate Gaussian distribution,

Matrices will be written in uppercase and vectors in lowercase.

8

Let’s get started!

f (x) = aT x

d(aT x) = aT dx

=⇒ Df (x) = aT

f (x) = xT Ax

d(xT Ax) = d(xT )Ax + xT d(Ax)

= (dx)T Ax + xT Adx

= xT AT dx + xT Adx

= xT (AT + A)dx

=⇒ Df (x) = xT (AT + A)

f (X) = aT Xb

d(aT Xb) = d tr(aT Xb)

= tr(aT d(X)b)

= tr(baT dX)

=⇒ Df (X) = baT

f (X) = aT XX T a

d(aT XX T a) = tr(aT d(XX T )a)

= tr(aaT d(XX T ))

= tr(aaT ((dX)X T + XdX T ))

= tr(aaT (dX)X T ) + tr((dX)X T aaT )

= tr(X T aaT dX) + tr(X T aaT dX)

= 2 tr(X T aaT dX)

=⇒ Df (X) = 2X T aaT

f (X) = tr(AX T BXC)

d tr(AX T BXC) = tr(Ad(X T )BXC) + tr(AX T B(dX)C)

= tr(C T X T B T (dX)AT ) + tr(CAX T BdX)

= tr((AT C T X T B T + CAX T B)dX)

=⇒ Df (X) = AT C T X T B T + CAX T B

f (X) = tr(AX −1 B)

f (X) = |X T X|

d tr(AX −1 B) = tr(BAd(X −1 ))

= − tr(BAX −1 (dX)X −1 )

= − tr(X −1 BAX −1 dX)

=⇒ Df (X) = −X −1 BAX −1

d|X T X| = |X T X| tr((X T X)−1 d(X T X))

= |X T X| tr((X T X)−1 (d(X T )X + X T dX))

= |X T X| tr((X T X)−1 d(X T )X) + tr((X T X)−1 X T dX)

= |X T X| tr(X T (dX)(X T X)−1 ) + tr((X T X)−1 X T dX)

= 2|X T X| tr((X T X)−1 X T dX)

=⇒ Df (X) = 2|X T X|(X T X)−1 X T

9

f (X) = tr(X p )

f (X) = Xa

d tr(X p ) = tr((dX)X p−1 + X(dX)X p−2 + · · · + X p−1 dX)

= tr(X p−1 dX + X p−1 dX + · · · + X p−1 dX)

= p tr(X p−1 dX)

=⇒ Df (X) = pX p−1

d(Xa) = (dX)a

= vec((dX)a)

= vec(In (dX)a)

(the vec of a vector is the vector itself)

= (aT ⊗ In )d vec X

=⇒ Df (X) = aT ⊗ In

To differentiate a matrix we need to vectorize it:

f (x) = xxT

d vec(xxT ) = vec((dx)xT + xdxT )

= vec(In (dx)xT ) + vec(x(dx)T In )

= (x ⊗ In )d vec x + (In ⊗ x)d vec(xT )

= (x ⊗ In )d vec x + (In ⊗ x)d vec x

=⇒ Df (x) = x ⊗ In + In ⊗ x

f (X) = X 2

d vec(X 2 ) = vec((dX)X + XdX)

= vec(In (dX)X + X(dX)In )

= (X T ⊗ In + In ⊗ X)d vec X

=⇒ Df (X) = X T ⊗ In + In ⊗ X

Now let’s compute some Hessians.

f (x) = aT x

d(aT x) = aT dx

d(aT dx) = 0

=⇒ Hf (x) = 0

f (X) = tr(AXB)

f (x) = xT Ax

d tr(AXB) = tr(a(dX)B)

= tr(BAdX)

d tr(BAdX) = 0

=⇒ Hf (x) = 0

d(xT Ax) = dxT Ax + xT Adx

= xT AT dx + xT Adx

= xT (AT + A)dx

d(xT (AT + A)dx) = dxT (AT + A)dx

=⇒ Hf (x) = AT + A

10

f (X) = tr(X T X)

d tr(X T X) = tr(dX T X + X T dX)

= tr(X T dX + X T dX)

= 2 tr(X T dX)

d(2 tr(X T dX)) = 2 tr(dX T dX)

= 2(d vec X)T d vec X

=⇒ Hf (X) = 2Imn

f (X) = tr(AX T BX)

(X is m × n)

d tr(AX T BX) = tr(AdX T BX + AX T BdX)

= tr(X T B T dXAT + AX T BdX)

= tr(AT X T B T dX + AX T BdX)

d(tr(AT X T B T dX + AX T BdX)) = tr(AT dX T B T dX + AdX T BdX)

= tr(dX T BdXA + AdX T BdX)

= 2 tr(AdX T BdX)

= 2 tr(dX T B(dX)A)

= 2(d vec X)T vec(B(dX)A)

= 2(d vec X)T (AT ⊗ B)d vec X

1

=⇒ Hf (X) = (2AT ⊗ B + 2A ⊗ B T )

2

=⇒ Hf (X) = AT ⊗ B + A ⊗ B T

Here we’ll make use of the commutation matrix Kmn . Remember that Knn = Kn is symmetric.

f (X) = tr(X 2 )

d tr(X 2 ) = tr((dX)X + XdX)

= 2 tr(XdX)

d(2 tr(XdX)) = 2 tr(dXdX)

(X is n × n)

= 2(vec(dX T ))T d vec X

= 2(Kn vec(dX))T d vec X

= 2(d vec X)T Kn d vec X

=⇒ Hf (X) = 2Kn

f (X) = tr(AXBX)

d tr(AXBX) = tr(A(dX)BX + AXBdX)

= tr(BXAdX + AXBdX)

d tr(BXAdX + AXBdX) = tr(B(dX)AdX + A(dX)BdX)

= 2 tr(A(dX)BdX)

= 2 tr((dX)B(dX)A)

(X is m × n)

= 2(vec(dX T ))T vec(B(dX)A)

= 2(vec(dX T ))T (AT ⊗ B)d vec X

= 2(Kmn vec(dX))T (AT ⊗ B)d vec X

= 2(d vec X)T Knm (AT ⊗ B)d vec X

=⇒ Hf (X) = Knm (AT ⊗ B) + (A ⊗ B T )Kmn

=⇒ Hf (X) = Knm (AT ⊗ B + B T ⊗ A)

11

In the last step above we used the fact that (A ⊗ B T )Kmn = Knm (B T ⊗ A).

f (X) = aT XX T a

d(aT XX T a) = aT (dX)X T a + aT X(dX)T a

= aT (dX)X T a + aT (dX)X T a

= 2aT (dX)X T a

d(2aT (dX)X T a) = 2aT (dX)(dX)T a

= 2 tr(aT (dX)(dX)T a)

= 2 tr((dX)T aaT dX)

= 2(d vec X)T vec(aaT dX)

= 2(d vec X)T vec(aaT (dX)I)

= 2(d vec X)T (I ⊗ aaT )d vec X

=⇒ Hf (X) = 2(I ⊗ aaT )

f (X) = tr(X −1 )

d tr(X −1 ) = tr(d(X −1 ))

= − tr(X −1 (dX)X −1 )

d(− tr(X −1 (dX)X −1 )) = − tr(d(X −1 )(dX)X −1 + X −1 (dX)d(X −1 ))

= − tr(−X −1 (dX)X −1 (dX)X −1 − X −1 (dX)X −1 (dX)X −1 )

= 2 tr((dX)X −1 (dX)X −2 )

= 2(vec(dX T ))T vec(X −1 (dX)X −2 )

= 2(Kn vec(dX))T vec(X −1 (dX)X −2 )

= 2(d vec X)T Kn (X −2T ⊗ X −1 )d vec X

=⇒ Hf (X) = Kn (X −2T ⊗ X −1 ) + (X −2 ⊗ X −T )Kn

=⇒ Hf (X) = Kn (X −2T ⊗ X −1 + X −T ⊗ X −2 )

f (X) = |X|

d|X| = |X| tr(X −1 dX)

d(|X| tr(X −1 dX)) = d|X| tr(X −1 dX) + |X| tr(d(X −1 )dX)

= |X|(tr(X −1 dX))2 − |X| tr(X −1 (dX)X −1 dX)

= |X| tr((dX)T X −T ) tr(X −1 dX) − |X| tr((dX)X −1 (dX)X −1 )

= |X|(d vec X)T vec(X −T )(vec(X −T ))T d vec X

− |X|(vec(dX T ))T (X −T ⊗ X −1 )d vec X

= |X|(d vec X)T vec(X −T )(vec(X −T ))T − Kn (X −T ⊗ X −1 ) d vec X

=⇒ Hf (X) = |X| vec(X −T )(vec(X −T ))T − Kn (X −T ⊗ X −1 )

Note that the Hessian above (like all the others) is symmetric; in fact,

(Kn (X −T ⊗ X −1 ))T = (X −1 ⊗ X −T )Kn = Kn (X −T ⊗ X −1 )

We conclude this section with an example about MLE.

Given a set of vectors x1 , . . . , xN drawn independently from a multivariate Gaussian distribution, we

want to estimate the parameters of the distribution by maximum likelihood. Note that the covariance

matrix Σ, and thus Σ−1 , is symmetric. The log likelihood function is

N

N

1X

ND

ln(2π) − ln |Σ| −

(xn − µ)T Σ−1 (xn − µ).

ln p(x1 , . . . , xN |µ, Σ) = −

2

2

2 n=1

12

We first find the derivative with respect to µ:

!

N

N

1X

1X

T −1

(xn − µ) Σ (xn − µ) = −

d(xn − µ)T Σ−1 (xn − µ) + (xn − µ)T Σ−1 d(xn − µ)

d −

2 i=1

2 i=1

N

1X

=−

−dµT Σ−1 (xn − µ) − (xn − µ)T Σ−1 dµ

2 i=1

N

1X

=

(xn − µ)T Σ−1 dµ + (xn − µ)T Σ−1 dµ

2 i=1

!

N

X

=

(xn − µ)T Σ−1 dµ

i=1

N

∂ ln p X

=⇒

=

(xn − µ)T Σ−1

∂µT

i=1

Therefore,

N

X

i=1

T

(xn − µ) Σ

−1

= 0 ⇐⇒

N

X

xTn

n=1

= Nµ

T

N

1 X

⇐⇒ µ

ˆ=

xn

N i=1

Finally, we find the derivative with respect to Σ:

!

N

N

1X

1X

N

N

tr (xn − µ)T d(Σ−1 )(xn − µ)

(xn − µ)T Σ−1 (xn − µ) = − tr(Σ−1 dΣ) −

d − ln |Σ| −

2

2 i=1

2

2 n=1

N

N

1X

−1

= − tr(Σ dΣ) +

tr (xn − µ)T Σ−1 (dΣ)Σ−1 (xn − µ)

2

2 n=1

N

N

1X

−1

tr Σ−1 (xn − µ)(xn − µ)T Σ−1 dΣ

= − tr(Σ dΣ) +

2

2 n=1

N

∂ ln p

N −1 1 X −1

=⇒

=

−

Σ +

Σ (xn − µ)(xn − µ)T Σ−1

∂ΣT

2

2 i=1

Therefore,

N

N

N

1 X −1

N −1 1 X −1

Σ =

− Σ−1 +

Σ (xn − µ)(xn − µ)T Σ−1 = 0 ⇐⇒

Σ (xn − µ)(xn − µ)T Σ−1

2

2 i=1

2

2 i=1

−1

⇐⇒ N = Σ

ˆ= 1

⇐⇒ Σ

N

N

X

(xn − µ)(xn − µ)T

i=1

N

X

(xn − µ

ˆ)(xn − µ

ˆ)T

i=1

References

[1] Abadir, K. M., & Magnus, J. R. (2005). Matrix algebra. Cambridge, UK: Cambridge University

Press.

[2] Magnus, J. R. & Neudecker, H. (1999). Matrix differential calculus with applications in statistics

and economics. New York, NY: John Wiley & Sons

[3] Steven W. Nydick (2012). A Different(ial) Way: Matrix Derivatives Again.

[4] Thomas Minka (2000). Old and New Matrix Algebra Useful for Statistics.

13

MatrixCalculus.pdf (PDF, 202.45 KB)

Download PDF

Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by e-Mail, Messenger, Whatsapp, Line..

Use the short link to share your document on Twitter or by text message (SMS)

Copy the following HTML code to share your document on a Website or Blog

This file has been shared publicly by a user of

Document ID: 0000114640.