# PDF Archive

Easily share your PDF documents with your contacts, on the Web and Social Networks.

## Deep Learning .pdf

Original filename: Deep Learning.pdf

This PDF 1.4 document has been generated by / 3-Heights(TM) PDF Optimization Shell 4.6.23.0 (http://www.pdf-tools.com), and has been sent on pdf-archive.com on 07/04/2016 at 14:19, from IP address 45.55.x.x. The current document download page has been viewed 10994 times.
File size: 52.8 MB (802 pages).
Privacy: public file

### Document preview

Deep Learning
Deep Learning
Ian Go
Goo
odfello
dfellow
w
Yosh
oshua
ua Bengio
Ian
Go
odfellow
Aaron Courville
Yoshua Bengio
Aaron Courville

Con
Conten
ten
tents
ts
Contents
Website

vii

Wcebsite
A
kno
knowledgmen
wledgmen
wledgments
ts

vii
viii

Acknowledgments
Notation

viii
xi

Notation
1
In
Intro
tro
troduction
duction
1.1 Who Should Read This Bo
Book?
ok? . . . . . . . .
1 1.2
Introduction
Historical Trends in Deep Learning . . . . .
1.1 Who Should Read This Book? . . . . . . . .
1.2 Historical Trends in Deep Learning . . . . .
I Applied Math and Mac
Machine
hine Learning Basics
I Applied
Math and Machine Learning Basics
2
Linear Algebra
2.1 Scalars, Vectors, Matrices and Tensors . . .
2 2.2
LinearMultiplying
Algebra Matrices and Vectors . . . . . .
2.1 Iden
Scalars,
ectors,
Matrices
and T
2.3
Identit
tit
tity
yV
and
In
Inverse
verse
Matrices
. ensors
. . . . .. .. ..
2.2 Linear
Multiplying
Matrices
and
Vectors
2.4
Dep
Dependence
endence
and
Span
. . .. .. .. .. .. ..
2.3
Iden
tit
y
and
In
verse
Matrices
2.5 Norms . . . . . . . . . . . . . .. .. .. .. .. .. .. ..
2.4 Sp
Linear
endence
and Span
. . . . .. .. ..
2.6
Special
ecial Dep
Kinds
of Matrices
and V. ectors
2.5
Norms
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.7 Eigendecomp
Eigendecomposition
osition . . . . . . . . .. .. .. .. .. ..
2.6 Singular
Special Kinds
Matrices
and V
2.8
ValueofDecomp
Decomposition
osition
. ectors
. . . . .. .. ..
2.7 The
Eigendecomp
osition Pseudoinv
. . . . . . erse
. . .. .. .. .. .. ..
2.9
Mo
Moore-P
ore-P
ore-Penrose
enrose
Pseudoinverse
2.8
Singular
V
alue
Decomp
osition
2.10 The Trace Op
Operator
erator . . . . . .. .. .. .. .. .. .. ..
2.9 The
The Determinan
Mo
ore-Penrose
2.11
Determinant
t . Pseudoinv
. . . . . . erse
. . .. .. .. .. .. ..
2.10 Example:
The Trace Principal
OperatorComp
. . .onents
. . . .Analysis
. . . . . ..
2.12
Components
2.11 The Determinant . . . . . . . . . . . . . . .
2.12
Example:
Principal
Components
Analysis .
3 Probabilit
Probability
y and
Information
Theory
3.1 Wh
Why
y Probabilit
Probability?
y? . . . . . . . . . . . . . . .
3 Probability and Information Theory
3.1 Why Probability? . . . . . . . . . . . . . . .
i
i

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

.
.
.
.

xi1
8
1
11
8
11
29

.
.
..
..
..
..
..
..
..
..
..
..
.
.
.

29
31
31
31
34
31
36
34
37
36
39
37
40
39
42
40
44
42
45
44
46
45
47
46
48
47
48
53
54
53
54

. . . . . . . . . . . .

CONTENTS

3.2 Random Variables . . . . . . . . . . . . . .
3.3 Probabilit
Probability
y Distributions . . . . . . . . . . .
3.2 Marginal
Random V
ariables y. .. .. .. .. .. .. .. .. .. .. .. .. ..
3.4
Probabilit
Probability
3.3 Conditional
Probability Distributions
3.5
Probabilit
Probability
y .. .. .. .. .. .. .. .. .. .. ..
3.4
Marginal
Probabilit
y
.
. . . . .Probabilities
. . . . . . .
3.6 The Chain Rule of Conditional
3.5 Indep
Conditional
y . . . Indep
. . . .endence
. . . .
3.7
Independence
endenceProbabilit
and Conditional
Independence
3.6 Exp
The
Chain Rule
of Conditional
Probabilities
3.8
Expectation,
ectation,
Variance
and Co
Cov
variance
. . .
3.7
Indep
endence
and
Conditional
Indep
endence
3.9 Common Probabilit
Probability
y Distributions . . . . .
3.8 Useful
Expectation,
Variance
and CovFariance
3.10
Prop
Properties
erties
of Common
unctions . .. ..
3.9 Ba
Common
Probabilit
3.11
Bay
yes’ Rule
. . . . y. Distributions
. . . . . . . . .. .. .. .. ..
3.10
Useful
Prop
erties
of
Common
Functions
3.12 Technical Details of Con
Contin
tin
tinuous
uous
Variables . ..
3.11 Information
Bayes’ Rule Theory
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
3.13
3.12
T
echnical
Details
of Contin
uous
3.14 Structured Probabilistic
Mo
Models
dels V
. ariables
. . . . . ..
3.13 Information Theory . . . . . . . . . . . . . .
3.14 Structured
Probabilistic Models . . . . . . .
4 Numerical
Computation
4.1 Ov
Overﬂo
erﬂo
erﬂow
w and Underﬂo
Underﬂow
w . . . . . . . . . . .
4 4.2
Numerical
Computation
Poor Conditioning . . . . . . . . . . . . . .
Ov
erﬂowt-Based
and Underﬂo
w . . . .. .. .. .. .. .. .. ..
4.3
Optimization
4.2 Constrained
Poor Conditioning
. . . . .. .. .. .. .. .. .. .. .. ..
4.4
Optimization
4.3
t-Based
Optimization
4.5 Example: Linear Least Squares. .. .. .. .. .. .. ..
4.4 Constrained Optimization . . . . . . . . . .
4.5
Example:
Linear
Least Squares . . . . . . .
5 Mac
Machine
hine
Learning
Basics
5.1 Learning Algorithms . . . . . . . . . . . . .
5 5.2
Machine
Learning
Basicsand Underﬁtting . . .
Capacit
Capacity
y, Overﬁtting
5.1 Hyp
Learning
Algorithms
. alidation
. . . . . .Sets
. . .. .. .. ..
5.3
Hyperparameters
erparameters
and V
5.2
Capacit
y
,
Overﬁtting
and
Underﬁtting
5.4 Estimators, Bias and Variance . . . . . .. .. ..
5.3 Maxim
Hyp
erparameters
and
Validation Sets
5.5
Maximum
um Lik
Likeliho
eliho
elihoo
od Estimation
. . .. .. .. ..
5.4 Ba
Estimators,
Bias and. V. ariance
5.6
Bay
yesian Statistics
. . . . .. .. .. .. .. .. .. ..
5.5
Maxim
um
Lik
eliho
o
d
Estimation
5.7 Sup
Supervised
ervised Learning Algorithms . .. .. .. .. .. ..
5.6 Unsup
Ba
yesian
Statistics
. . Algorithms
. . . . . . . .. .. .. .. ..
5.8
Unsupervised
ervised
Learning
5.7
Sup
ervised
Learning
Algorithms
5.9 Sto
Descentt . . .. .. .. .. .. .. ..
5.8 Building
Unsupervised
Learning
Algorithms
. . . .. ..
5.10
a Machine
Learning
Algorithm
5.9 Challenges
t Descen
. . . . . .. .. .. ..
5.11
Motiv
Motivating
ating
Deept Learning
5.10 Building a Machine Learning Algorithm . .
5.11 Challenges Motivating Deep Learning . . . .
II Deep Net
Netw
works: Mo
Modern
dern Practices
II Deep
Deep FNet
works:
dern
Practices
6
eedforw
eedforward
ardMo
Netw
Networks
orks
6.1 Example: Learning XOR . . . . . .
6 6.2
Feedforw
ard Netw
orks. . . . . .
t-Based
Learning
6.1 Example: Learning XOR . . . . . .
6.2 Gradient-Based Learning . . ii. . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

.
.
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
.
.

.
.
.
.

56
56
56
58
56
59
58
59
59
60
59
60
60
62
60
67
62
70
67
71
70
72
71
75
72
75
80
80
80
82
80
82
82
93
82
96
93
96
98
99
98
110
99
120
110
122
120
131
122
135
131
139
135
145
139
150
145
152
150
154
152
154
165
165
167
170
167
176
170
176

CONTENTS

7
7

8
8

9
9

6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.4 Arc
Architecture
hitecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.3 Bac
Hidden
Units . . and
. . . Other
. . . .Diﬀeren
. . . . tiation
. . . . Algorithms
. . . . . . . .. .. .. .. .. 203
190
6.5
Back-Propagation
k-Propagation
Diﬀerentiation
6.4 Historical
Architecture
Design
196
6.6
Notes
. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 224
6.5 Back-Propagation and Other Diﬀerentiation Algorithms . . . . . 203
6.6 Historical Notes
. . . Learning
. . . . . . . . . . . . . . . . . . . . . . . . . 228
224
Regularization
for Deep
7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230
Regularization
for Deep
LearningOptimization . . . . . . . . . . . . 228
7.2
Norm Penalties
as Constrained
237
7.1
P
arameter
Norm
P
enalties
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
230
7.3 Regularization and Under-Constrained Problems . . . . . . . . . 239
7.2 Dataset
Norm Penalties
as
Constrained
237
7.4
Augmen
Augmentation
tation
. . . . .Optimization
. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 240
7.3 Noise
Regularization
and. Under-Constrained
239
7.5
Robustness
. . . . . . . . . . . Problems
. . . . . . .. .. .. .. .. .. .. .. .. 242
7.4
Dataset
Augmen
tation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
240
7.6 Semi-Sup
Semi-Supervised
ervised Learning . . . . . . . . . . . . . . . . . . . . . . 244
7.5 Multi-T
Noise
Robustness
. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245
242
7.7
7.6
Semi-Sup
ervised
Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
244
7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.7 P
Multi-T
. arameter
. . . . . .Sharing
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245
7.9
arameter
ying and P
251
7.8
Early
Stopping
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
246
7.10 Sparse Represen
Representations
tations . . . . . . . . . . . . . . . . . . . . . . . . 253
7.9
P
arameter
T
ying
and
Parameter
Sharing
251
7.11 Bagging and Other
Ensemble
Metho
Methods
ds . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255
7.10 Drop
Sparse
7.12
Dropout
outRepresen
. . . . tations
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 253
257
7.11
Bagging
and
Other
Ensemble
Metho
ds
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
255
dversarial
ersarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.12
out Distance,
. . . . . .Tangent
. . . . .Prop,
. . . and
. . .Manifold
. . . . . T.angent
. . . . Classiﬁer
. . . . . 268
257
7.14 Drop
Tangent
7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.14 Tangent Distance,
Tangent
Prop,
and
268
Optimization
for Training
Deep
Mo
Models
delsManifold Tangent Classiﬁer 274
8.1 Ho
How
w Learning Diﬀers from Pure Optimization . . . . . . . . . . . 275
Optimization
raining
Deep
Models
8.2
Challengesfor
in T
Neural
Netw
Network
ork Optimization
. . . . . . . . . . . . 274
282
8.1
Ho
w
Learning
Diﬀers
from
P
ure
Optimization
.
.
.
.
.
.
.
.
.
.
.
275
8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.2
Challenges
in
Neural
Netw
ork
Optimization
.
.
.
.
.
.
.
.
.
.
.
.
8.4 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . 282
301
8.3 Algorithms
Basic Algorithms
. . . . e. Learning
. . . . . Rates
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 306
294
8.5
8.4 Appro
P
arameter
Initialization
Strategies
301
8.6
Approximate
ximate
Second-Order
Metho
Methods
ds. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 310
8.5
Algorithms
with
e
Learning
Rates
.
.
.
.
.
.
.
.
.
.
.
.
.
306
8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 318
8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310
8.7
Optimization
Strategies
318
Con
Conv
volutional
Netw
Networks
orks and Meta-Algorithms . . . . . . . . . . . 331
9.1 The Con
Conv
volution Op
Operation
eration . . . . . . . . . . . . . . . . . . . . . 332
Convolutional
9.2
Motiv
Motivation
ationNetw
. . . orks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
336
9.1
The
Con
v
olution
Op
eration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
332
9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.2
Motiv
ation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9.4 Con
346
Conv
volution and Pooling as an Inﬁnitely Strong Prior . . . . . . . 336
9.3
P
o
oling
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
340
9.5 Variants of the Basic Con
Conv
volution Function . . . . . . . . . . . . 348
9.4
Con
v
olution
and
P
o
oling
as
an
Inﬁnitely
Strong
Prior
.
.
.
.
.
.
.
346
9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359
9.5 Data
Variants
ofesthe
9.7
Typ
ypes
. .Basic
. . . Con
. . v. olution
. . . .F
. unction
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348
361
9.6
Structured
Outputs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
359
9.8 Eﬃcien
Eﬃcientt Con
Conv
volution Algorithms . . . . . . . . . . . . . . . . . . 363
9.7
Data
T
yp
es
. . . ervised
. . . . .Features
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 364
361
9.9 Random or Unsup
Unsupervised
9.8 Eﬃcient Convolution Algorithms . . . . . . . . . . . . . . . . . . 363
iii
9.9 Random or Unsupervised Features
. . . . . . . . . . . . . . . . . 364

CONTENTS

10
10

11
11

12
12

III

9.10 The Neuroscien
Neuroscientiﬁc
tiﬁc Basis for Conv
Convolutional
olutional Netw
Networks
orks . . . .
9.11 Con
Conv
volutional Net
Networks
works and the History of Deep Learning .
9.10 The Neuroscientiﬁc Basis for Convolutional Networks . . . .
9.11
ConvMo
olutional
NetRecurrent
works and the
History
of Deep
Learning .
Sequence
Modeling:
deling:
and
Recursiv
Recursive
e Nets
10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . .
Sequence
Motdeling:
10.2
Recurren
Recurrent
Neural Recurrent
Net
Netw
works .and
. . .Recursiv
. . . . . e. Nets
. . . . . . . .
10.1 Bidirectional
Unfolding Computational
10.3
RNNs . . . .Graphs
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
10.2 Enco
Recurren
t Neural
Networks . . . . . . .Architectures
. . . . . . . .. .. .. ..
10.4
Encoder-Deco
der-Deco
der-Decoder
der Sequence-to-Sequence
10.3
Bidirectional
RNNs
.
.
.
.
.
.
.
.
.
.
.
.
.
10.5 Deep Recurren
Recurrentt Net
Netw
works . . . . . . . . . .. .. .. .. .. .. .. .. .. ..
10.4 Recursiv
Enco
der-Deco
der Net
Sequence-to-Sequence
10.6
Recursive
e Neural
Netw
works . . . . . . . .Architectures
. . . . . . . .. .. .. ..
10.5 The
DeepChallenge
RecurrentofNet
workserm. .Dep
. .endencies
. . . . . .. .. .. .. .. .. .. .. .. ..
10.7
Long-T
Long-Term
Dependencies
10.6
Recursiv
e
Neural
Net
w
orks
.
.
.
.
.
10.8 Ec
Echo
ho State Net
Netw
works . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
10.7 Leaky
The Challenge
of Other
Long-T
erm Dependencies
. . Time
. . . .Scales
. . . ..
10.9
Units and
Strategies
for Multiple
10.8
Ec
ho
State
Net
w
orks
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10.10 The Long Short-T
Short-Term
erm Memory and Other Gated RNNs .. .. ..
10.9 Optimization
Leaky Units and
Strategies
for Multiple
10.11
for Other
Long-T
Long-Term
erm Dep
Dependencies
endencies
. . Time
. . . .Scales
. . . ..
10.10 Explicit
The LongMemory
Short-Term
10.12
. . .Memory
. . . . .and
. . Other
. . . . Gated
. . . .RNNs
. . . .. .. ..
10.11 Optimization for Long-Term Dependencies . . . . . . . . . .
10.12
Explicit
Memory
Practical
metho
methodology
dology. . . . . . . . . . . . . . . . . . . . . . . .
11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . .
Practical
metho
dology
11.2
Default
Baseline
Mo
Models
dels . . . . . . . . . . . . . . . . . . . .
11.1 Determining
Performance Whether
Metrics .to. Gather
. . . . .More
. . .Data
. . . .. .. .. .. .. .. .. .. ..
11.3
11.2 Selecting
Default Baseline
Models . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
11.4
Hyp
Hyperparameters
erparameters
11.3
Determining
Whether
11.5 Debugging Strategies .to. Gather
. . . . .More
. . .Data
. . . .. .. .. .. .. .. .. .. ..
11.4 Example:
Selecting Hyp
erparameters
. . Recognition
. . . . . . . .. .. .. .. .. .. .. .. .. ..
11.6
Multi-Digit
Number
11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . .
11.6
Example: Multi-Digit Number Recognition . . . . . . . . . .
Applications
12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . .
Applications
12.2
Computer Vision . . . . . . . . . . . . . . . . . . . . . . . .
12.1
Large
Deep Learning
12.3 Sp
Speec
eec
eech
hScale
Recognition
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
12.2 Natural
Computer
Vision Pro
. . cessing
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
12.4
Language
Processing
12.3 Other
SpeechApplications
Recognition .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
12.5
12.4 Natural Language Processing . . . . . . . . . . . . . . . . .
12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . .
Deep Learning Researc
Research
h

III Linear
Deep F
Learning
Researc
h
13
actor Mo
Models
dels
13.1 Probabilistic PCA and Factor Analysis .
13 13.2
LinearIndep
Factor
Mo
dels onent Analysis (ICA)
Independen
enden
endent
t Comp
Component
13.1
Probabilistic
PCA
and F.actor
13.3 Slo
Slow
w Feature Analysis
. . . Analysis
. . . . . ..
13.2 Sparse
Independen
t Comp
13.4
Co
Coding
ding
. . .onent
. . . Analysis
. . . . . (ICA)
. . . .
13.3 Slow Feature Analysis . . . . . . . . . .
13.4 Sparse Coding . . . . . . . . iv
. . . . . . .

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
..
..
.
.

.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
..
.
.
.
.
..
..
..
.
.

.
.
..
..
.
.

.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
.
.
.
.
..
..
..
..
.
.
.
.
..
..
..
.
.

.
.
..
..
.
.

. 365
. 372
. 365
. 374
372
. 376
. 374
379
376
.. 396
.. 379
397
.. 399
396
.. 397
401
.. 403
399
401
.. 406
403
.. 409
.. 406
411
.. 409
415
411
.. 419
. 415
. 424
419
. 425
. 424
428
425
.. 429
428
.. 430
.. 439
429
430
.. 443
. 439
443
. 446
. 446
. 446
455
.. 446
461
455
.. 464
461
.. 480
. 464
. 480
489
.
.
..
..
.
.

489
492
493
492
494
493
496
494
499
496
499

CONTENTS

13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 502
Manifold
14 13.5
Auto
Autoenco
enco
encoders
ders Interpretation of PCA . . . . . . . . . . . . . . . . . . .
14.1 Undercomplete Auto
Autoenco
enco
encoders
ders . . . . . . . . . . . . . . . . . . . .
14 14.2
Autoenco
ders
Regularized Auto
Autoenco
enco
encoders
ders . . . . . . . . . . . . . . . . . . . . . .
14.1
Undercomplete
Auto
enco
ders
. . . and
. . .Depth
. . . .. .. .. .. .. .. .. .. .. .. ..
14.3 Represen
Representational
tational Power, La
Lay
yer Size
14.2 Sto
Regularized
Auto
enco
dersDeco
. .ders
. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
14.4
Stocchastic Enco
Encoders
ders
and
Decoders
14.3 Denoising
Representational
Power,
14.5
Auto
Autoenco
enco
encoders
ders La
. y. er. .Size
. . and
. . .Depth
. . . .. .. .. .. .. .. .. .. .. .. ..
14.4
Sto
c
hastic
Enco
ders
and
Deco
ders
.
.
.
.
14.6 Learning Manifolds with Auto
Autoenco
enco
encoders
ders . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
14.5 Con
Denoising
enco
ders
14.7
Contractiv
tractiv
tractiveAuto
e Auto
Autoenco
enco
encoders
ders. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
14.6 Predictiv
Learning
with Auto
encoders
14.8
PredictiveeManifolds
Sparse Decomp
Decomposition
osition
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
14.7
Con
tractiv
e
Auto
enco
ders
.
.
.
.
14.9 Applications of Auto
Autoenco
enco
encoders
ders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . .
Applications
of Autoencoders . . . . . . . . . . . . . . . . . . . .
15 14.9
Represen
Representation
tation Learning
15.1 Greedy La
Lay
yer-Wise Unsup
Unsupervised
ervised Pretraining . . . . . . . . . . .
15 15.2
Represen
tation
Learning
Transfer
Learning
and Domain Adaptation . . . . . . . . . . . . .
15.1 Semi-Sup
Greedy
Laervised
yer-Wise
Unsupervised
retraining
. . .. .. .. .. .. .. .. .. ..
15.3
Semi-Supervised
Disentangling
of P
Causal
Factors
15.2
T
ransfer
Learning
and
Domain
A
daptation
.
.
.
15.4 Distributed Representation . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. ..
15.3 Exp
Semi-Sup
ervised
Disentangling
15.5
Exponen
onen
onential
tial
Gains
from Depth of. Causal
. . . . F. actors
. . . . .. .. .. .. .. .. .. .. ..
15.4
Distributed
Representation
.
.
.
.
.
.
.
.
.
. . . .. .. .. .. .. .. .. .. .. ..
15.6 Pro
Providing
viding Clues to Disco
Discov
ver Underlying Causes
15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . .
Providing
Clues to Disco
ver
Underlying
. . . . . . . . . .
16 15.6
Structured
Probabilistic
Mo
Models
dels
for DeepCauses
Learning
16.1 The Challenge of Unstructured Mo
Modeling
deling . . . . . . . . . . . . . .
16 16.2
Structured
Probabilistic
Mo
dels
for
Deep Learning
Using Graphs to Describ
Describee Mo
Model
del Structure
. . . . . . . . . . . . .
16.1 Sampling
The Challenge
Unstructured
Mo.deling
16.3
from of
Graphical
Mo
Models
dels
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
16.2 A
Using
Graphs
Describe Mo
Modeling
del Structure
16.4
dv
dvantages
antages
of to
Structured
Modeling
. . . . .. .. .. .. .. .. .. .. .. .. .. .. ..
16.3
Sampling
from
Graphical
Mo
dels
.
16.5 Learning ab
out Dep
Dependencies
endencies . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
16.4 Inference
of Approximate
Structured Mo
deling . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
16.6
and
Inference
16.5
Learning
ab
out
Dep
endencies
. . . . . . Probabilistic
. . . . . . . Mo
. . dels
. . .
16.7 The Deep Learning Approach to. .Structured
Models
16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . .
DeepMetho
Learning
17 16.7
Mon
Monte
teThe
Carlo
Methods
ds Approach to Structured Probabilistic Models
17.1 Sampling and Monte Carlo Metho
Methods
ds . . . . . . . . . . . . . . . .
17 17.2
MonteImp
Carlo
Metho
ds
Sampling
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . . .
Importance
ortance
17.1
Sampling
and
Monte
Carlo
Metho
ds
17.3 Mark
Marko
ov Chain Mon
Monte
te Carlo Metho
Methods
ds .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
17.2
Imp
ortance
Sampling
.
.
.
.
.
.
.
17.4 Gibbs Sampling . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
17.3 The
MarkChallenge
ov Chain Mon
te Carlo
Metho
ds . . . . Mo
. . des
. . .. .. .. .. .. .. .. ..
17.5
of Mixing
betw
etween
een Separated
Modes
17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theting
Challenge
of MixingFunction
between Separated Modes . . . . . . . .
18 17.5
Confron
Confronting
the Partition
18.1 The Log-Lik
Log-Likeliho
eliho
elihoo
od Gradient . . . . . . . . . . . . . . . . . . . .
18 18.2
Confron
ting
theMaximum
Partition
Function
Sto
Stoc
chastic
Likelihoo
Likelihood
d and Contrastiv
Contrastivee Divergence . . .
18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . .
18.2 Stochastic Maximum Likelihoo
v d and Contrastive Divergence . . .

502
505
506
505
507
506
511
507
512
511
513
512
518
513
524
518
526
524
527
526
527
529
531
529
539
531
544
539
549
544
556
549
557
556
557
561
562
561
566
562
583
566
584
583
585
584
586
585
587
586
587
593
593
593
595
593
598
595
602
598
602
602
602
608
609
608
610
609
610

CONTENTS

18.3 Pseudolik
Pseudolikeliho
eliho
elihoo
od . . . . . . . . . . . . . . . . . . . . . . .
18.4 Score Matc
Matching
hing and Ratio Matching . . . . . . . . . . . .
18.3 Denoising
Pseudolikeliho
odMatching
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
18.5
Score
18.4 Noise-Con
Score
Matctrastiv
hing and
Ratio Matching
18.6
Noise-Contrastiv
trastive
e Estimation
. . . . .. .. .. .. .. .. .. .. .. .. .. ..
18.5
Denoising
Score
Matching
.
.
. . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
18.7 Estimating the Partition Function
18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . .
Estimating
the Partition Function . . . . . . . . . . . . . .
19 18.7
Appro
Approximate
ximate inference
19.1 Inference as Optimization . . . . . . . . . . . . . . . . . .
19 19.2
ApproExp
ximate
inference
Expectation
ectation
Maximization . . . . . . . . . . . . . . . . . .
19.1 MAP
Inference
as Optimization
. . ding
. . . .. .. .. .. .. .. .. .. .. .. .. .. ..
19.3
Inference
and Sparse Co
Coding
19.2 V
Exp
ectationInference
Maximization
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. ..
19.4
ariational
and Learning
19.3
MAP
Inference
and
Sparse
Co
ding
19.5 Learned Appro
Approximate
ximate Inference
. . .. .. .. .. .. .. .. .. .. .. .. .. ..
19.4 Variational Inference and Learning . . . . . . . . . . . . .
Appro
ximate
20 19.5
Deep Learned
Generativ
Generative
e Mo
Models
dels Inference . . . . . . . . . . . . . . .
20.1 Boltzmann Mac
Machines
hines . . . . . . . . . . . . . . . . . . . . .
20 20.2
Deep Restricted
GenerativBoltzmann
e Mo dels Machines . . . . . . . . . . . . . . .
20.1 Deep
Boltzmann
hines
20.3
Belief Mac
Netw
Networks
orks .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
20.2
Restricted
Boltzmann
Machines
20.4 Deep Boltzmann Machines
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
20.3 Boltzmann
Deep Belief Mac
Netw
orks for
. . Real-V
. . . .alued
. . . Data
. . . .. .. .. .. .. .. .. .. ..
20.5
Machines
hines
Real-Valued
20.4
Deep
Boltzmann
Machines
.
.
.
.
.
.
.
20.6 Con
Conv
volutional Boltzmann Mac
Machines
hines . . .. .. .. .. .. .. .. .. .. .. ..
20.5 Boltzmann
Boltzmann Mac
Mac
hines for
for Structured
Real-ValuedorData
. . . . Outputs
. . . . .
20.7
Machines
hines
Sequential
20.6 Other
Convolutional
Boltzmann
Mac. hines
20.8
Boltzmann
Machines
. . . .. .. .. .. .. .. .. .. .. .. .. .. ..
20.7
Boltzmann
Mac
hines
for
Structured
or Sequential
20.9 Bac
Back-Propagation
k-Propagation through Random Op
Operations
erations . Outputs
. . . . .
20.8 Directed
Other Boltzmann
Machines
20.10
Generative
Nets . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
20.9 Dra
Bac
k-Propagation
through
Random
20.11
Drawing
wing
Samples from
Auto
Autoenco
enco
encoders
dersOp. erations
. . . . . .. .. .. .. .. ..
20.10
Directed
Generative
Nets
.
.
.
.
.
.
.
.
20.12 Generativ
Generativee Sto
Stocchastic Net
Netw
works . . . . .. .. .. .. .. .. .. .. .. .. ..
20.11 Other
Drawing
Samples from
Auto. enco
20.13
Generation
Schemes
. . ders
. . . .. .. .. .. .. .. .. .. .. .. .. ..
20.12 Ev
Generativ
Stochastic Net
w
orks. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
20.14
Evaluating
aluatinge Generative
Mo
Models
dels
20.13
Other
Generation
Schemes
20.15 Conclusion . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
20.14 Evaluating Generative Models . . . . . . . . . . . . . . . .
20.15 Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliograph
Bibliography
y

.
.
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.

.
.
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.

.
.
..
..
..
.
.
.
.
..
..
..
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.

. 618
. 620
618
.. 622
620
.. 623
.. 626
622
. 623
. 634
626
. 636
. 634
637
636
.. 638
.. 637
641
.. 653
638
. 641
. 656
653
. 656
. 656
658
.. 662
656
.. 665
658
.. 662
678
.. 665
685
.. 678
687
.. 688
685
687
.. 689
.. 688
694
.. 712
689
.. 716
694
.. 712
717
.. 719
716
.. 717
721
. 719
721
. 723

Bibliography
Index

723
780

Index

780

vi

Website
Website

www.deeplearningb
www.deeplearningbo
ook.org
www.deeplearningbook.org

This book is accompanied by the ab
abov
ov
ovee website. The website provides a
variety of supplemen
supplementary
tary material, including exercises, lecture slides, corrections of
This
ookother
is accompanied
the ab
e website.
website
provides a
mistak
mistakes,
es, band
resources thatbyshould
beov
useful
to both The
and instructors.
variety of supplementary material, including exercises, lecture slides, corrections of
mistakes, and other resources that should be useful to both readers and instructors.

vii
vii

Ackno
knowledgmen
wledgmen
wledgments
ts
A
c
kno
wledgmen
ts
This book would not ha
hav
ve been possible without the con
contributions
tributions of man
many
y people.
would
lik
likeenot
to ha
thank
those
who commen
commented
prop
proposal
osal
for the
book
ThisWbeook
would
ve been
possible
without ted
the on
conour
tributions
of man
y people.
and help
helped
ed plan its con
conten
ten
tents
ts and organization: Guillaume Alain, Kyungh
Kyunghyun
yun Cho,
We Gülçehre,
would likeDa
to
thank
those
who
commen
ted
on
our
prop
osal
for
the
book
Çağlar
Krueger,
Hugo
Larochelle
Razv
Pascan
and
Thomas
David
vid
Larochelle,, Razvan
an Pascanu
u
and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho,
Rohée.
Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas
We would like to thank the people who oﬀered feedback on the conten
contentt of the
Rohée.
book itself. Some oﬀered feedbac
feedback
k on many chapters: Martín Abadi, Guillaume
WeIon
would
like to thank the
who
oﬀered
feedback
on Can
the conten
of the
Alain,
Androutsopoulos,
Fredpeople
Bertsc
Bertsch,
h, Olexa
Bilaniuk,
Ufuk
Biçici,tMatk
Matko
o
b
o
ok
itself.
Some
oﬀered
feedbac
k
on
many
c
hapters:
Martín
Guillaume
Bošnjak, John Boersma, Greg Bro
Brocckman, Pierre Luc Carrier, Sarath Chandar,
Alain,
Ion Androutsopoulos,
Fred
Bertsc
h, Olexa Bilaniuk,
Ufuk Can
Biçici,
Matko
P
awel Chilinski,
Mark Daoust,
Oleg
Dashevskii,
Laurent Dinh,
Stephan
Dreseitl,
Bošnjak,
John F
Boersma,
Brockman,
Pierre
Luc Carrier,
Sarath
Chandar,
Jim
Fan, Miao
an, MeireGreg
Fortunato,
Frédéric
Francis,
Nando de
Freitas,
Çağlar
P
a
w
el
Chilinski,
Mark
Daoust,
Oleg
Dashevskii,
Laurent
Dinh,
Stephan
Dreseitl,
Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram,
Jim Fan,Kab
Miao
an,Luk
Meire
Fortunato,
Frédéric
Francis,
NandoJohn
de FKing,
reitas,Diederik
Çağlar
Chingiz
Kabyta
yta
ytay
yFev,
Lukasz
asz Kaiser,
Varun
Akiel Khan,
Gülçehre,
Jurgen
V
an
Gael,
Javier
Alonso
García,
Jonathan
Hunt,
Gopi
Jeyaram,
P. Kingma, Yann LeCun, Rudolf Mathey
Mathey,, Matías Mattamala, Abhinav Maurya,
ChingizMurphy
Kabyta,yev,
Luk
asz Kaiser,
Varun
Akiel Khan,
John King,
Kevin
Murphy,
Oleg
Mürk,
Roman
Nov
Novak,
ak, Augustus
Q. Odena,
SimonDiederik
Pa
Pavlik,
vlik,
P
.
Kingma,
Y
ann
LeCun,
Rudolf
Mathey
,
Matías
Mattamala,
Abhinav
Maurya,
Karl Pichotta, Kari Pulli, Tapani Raiko, An
Anurag
urag Ranjan, Johannes Roith, Halis
Kevin
Murphy
,
Oleg
Mürk,
Roman
Nov
ak,
Augustus
Q. Odena,
Simon
Pavlik,
Sak, César Salgado, Grigory Sapunov, Mik
Mikee Sch
Schuster,
uster, Julian
Serban,
Nir Shabat,
Karl Shirriﬀ,
Pichotta,Scott
KariStanley
Pulli, T
Anurag
Ranjan,
Johannes
Roith, Halis
Ken
Stanley,
, apani
DavidRaiko,
Sussillo,
Ilya Sutsk
Sutskev
ev
ever,
er,
Sáez,
Sak,
César
Grigory
Sapunov,
Mik
e
Sch
uster,
Julian
Serban,
Nir
Shabat,
Graham Taylor, Valen
alentin
tin Tolmer, An Tran, Shubhendu Trivedi, Alexey Umnov,
Ken
Shirriﬀ,
Scott
Stanley
Sussillo, Ilya Sutsk
ever,WCarles
Sáez,
Vincen
Vincentt Vanhouc
anhouck
ke, Marco, David
Visen
Visentini-Scarzanella,
tini-Scarzanella,
Da
David
vid
arde-F
arde-Farley
arley
arley,
, Dustin
Graham
Taylor,
Tolmer,
Tran, tShubhendu
rivedi,
Alexey
W
ebb, Kelvin
Xu,Valen
Wei tin
Xue,
Li Yao,An
Zygmun
Zygmunt
Za
Zając
jąc and T
Ozan
Çağlay
Çağlayan.
an. Umnov,
Vincent Vanhoucke, Marco Visentini-Scarzanella, David Warde-Farley, Dustin
We Kelvin
would Xu,
also W
lik
like
to thank
those
who provided
us with
feedback
k on
Webb,
eie Xue,
Li Yao,
Zygmun
t Za jąc and
Ozanuseful
Çağlayfeedbac
an.
individual chapters:
We would also like to thank those who provided us with useful feedback on
individual
chapters:
• Chapter
1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi,
Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu
Chapter
1, Introduction
: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi,
and
Alfredo
Solano.
• Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu
Chapter
2, Linear
• and
Nikola
ola Banić, Kevin Bennett,
Alfredo
Solano.
viiiAlmahairi, Nikola Banić, Kevin Bennett,

viii

Copy tag