PDF Archive

Easily share your PDF documents with your contacts, on the Web and Social Networks.

Share a file Manage my documents Convert Recover PDF Search Help Contact

the game of chess and searches in protein sequence space .pdf

Original filename: the-game-of-chess-and-searches-in-protein-sequence-space.pdf
Title: PII: S0167-7799(98)01188-3

This PDF 1.2 document has been generated by QuarkXPressª: LaserWriter 8 B1-8.1.1 / Acrobat Distiller 3.01 for Power Macintosh, and has been sent on pdf-archive.com on 24/10/2016 at 23:24, from IP address 98.163.x.x. The current document download page has been viewed 355 times.
File size: 70 KB (3 pages).
Privacy: public file

Download original PDF file

Document preview


The game of chess and searches in
protein sequence space
The victory of the computer Deep
Blue 2 over the world chess champion, Gary Kasparov, symbolizes the
triumph of sophisticated calculations
on a superpowerful computer over
some of the activities in which the
human mind was thought to excel.
The amazing fact is, however, that
Deep Blue was not unbeatable; on
the contrary, it lost the first game, did
not seek a draw in the second (and
lost) and struggled numerous times in
positions in which Kasparov had the
advantage. Deep Blue simply played
better than Kasparov in that particular place, at that particular time,
under the rules adopted in the match.
Deep Blue’s computational abilities
are indeed awesome: it can analyse
2 3 108 positions sec21, or about
4 3 1010 positions during the time
allotted for a move, typically 3 min
(Ref. 1). Brute-force calculation,
however, is not the main strength of
Deep Blue; this lies rather in the skilfully written software, which quickly
eliminates the majority of moves at
each position as weak and thus not
acceptable. The tree of possible
moves and countermoves is searched
to a depth of 12 half-moves on each
side (12 ply in chess terminology).
The best three moves in the position
being analysed are then subjected to
even deeper analysis to a depth of
22 ply, after which the best move is
selected by the computer. Bruteforce calculation would enable the
computer to analyse the tree of positions to a depth of only 6 ply, because
406 (≈ 4 3 109) is close to the number of positions that can be analysed
in 3 min. However, the tree is actually searched to a depth of 12 ply as
a result of applying the a–b-pruning
algorithm, which effectively doubles
the ply level2,3. (This is, of course, a
greatly simplified summary of how
the program functions, and is far
from an adequate reflection of the
many person-years of programmers’
and engineers’ time and effort.)
Deep Blue did not have to play
perfect chess to beat the world champion; it simply needed to play better
than Kasparov. What would it take
for the computer to play a ‘perfect’
chess game? The computer would


need to look ahead for at least 40
moves, or 80 ply, which is the length
of a typical chess game. As the number of legal moves in a typical position on the chessboard is about 40
(M. Campbell, pers. commun.), this
translates into roughly 4080 5 10128
possible final positions to be analysed,
and into the same number of possible
chess games. For all practical purposes,
this number is not achievable (for
comparison, the estimated number
of atoms in the universe is ‘only’
about 1080).

Searches in protein sequence
space and phage display
What does all of this have to do
with searches in protein sequence
space? It turns out that the numerical complexities of both sequence
space and ‘chess space’ are roughly
the same. There are 20100 (≈ 10130)
possible protein sequences for a protein 100 amino acid residues in length
with 20 amino acids available at any
position of the polypeptide chain. A
complete and exhaustive search for
the fittest of all possible proteins, a
‘perfect’ protein, is equivalent to the
search for a perfect chess game, and
similarly unachievable, both for the
scientist working in the lab and for
nature. It can be estimated that less
than 1050 protein molecules could
have been made during the 4 billion
years of the history of life on the
Earth – [(4 3 109 yr) 3 weight of
carbon in living biomass on the

Earth] 4 weight of one protein molecule, assuming that the average lifetime of a protein molecule is 1 year
– and thus that less than 1050 proteins
can ever have been tested for function. Nevertheless, the complexity of
life is astounding. Certainly, farfrom-perfect search strategies in the
protein sequence space are sufficient
for the creation of a large number of
proteins and the maintenance of life.
But just how thorough must the
searches be to yield functional proteins that are indistinguishable from
natural proteins? Before we can find
the answer, we must pinpoint the
analogy between the search in the
protein sequence space and the game
of chess. Consider an attempt to construct a protein as a series of ‘moves’,
starting from the N terminus. The
‘sequence-space game’ continues by
selecting the second amino acid in
the polypeptide chain (second ply),
then the third, and so on, until the
100th amino acid is placed at the
C terminus (Table 1). If amino acids
are chosen at random, the peptide
generated would have few, if any, of
the attributes of natural protein
sequences. Proteins of 100 amino
acids are used here as a model system
because this size (10 kDa) is the
beginning of the molecular weight
size range characteristic of natural
proteins4 and is also typical of many
functional protein domains. Larger
proteins are frequently composed of
several smaller domains and have
evolved by mixing and matching the
smaller domains, rather than by
selection from random sequences.
Proteins can have different functions, including the ability to bind
to small molecules or other proteins
and enzymatic activity. Any given

Table 1. Analogy between choices of the moves in the chess game
and the choices of amino acids in the protein sequence.














Copyright © 1998, Elsevier Science Ltd. All rights reserved. 0167 – 7799/98/$19.00. PII: S0167-7799(98)01188-3







protein may perform several functions, such as binding to two or three
different molecules and having one
or more enzymatic activities. One of
the more sophisticated protein designs
is that of the DNA polymerases: a
single protein (often one polypeptide
chain) is able to bind to both the DNA
template and to single nucleotides, to
catalyse the incorporation of the
nucleotides into a growing DNA
chain, to move along the template,
and to have one or two exonuclease
activities. In the current analysis, we
will be concerned with one of the
simplest functions: the binding of the
protein to another protein.
A model has been proposed for the
sequence space for proteins roughly
100 amino acid residues long5. This
model was based on the observed
frequencies of deleterious mutations
and of mutations that improve the
‘fitness’ of the proteins (i.e. activity,
binding constant, etc.) in natural proteins, and on the frequency of overlapping genes among all genes. The
main conclusion from this model is
that, of 100-residue-long random
proteins, from one in 1014 to one in
1017 are able to bind another peptide.
This estimate seems very high, because
it implies that there are 10113 to 10116
sequences among the 10130 possible
protein sequences of this size that can
bind to a target specifically and tightly.
However, there is a growing body
of data that indicates that this frequency estimate may be accurate.
This includes phage-display experiments, which are rapidly becoming
one of the more popular techniques
in molecular biology6. In a typical
phage library, the phage carries a
recombinant peptide fused to the
phage coat protein pIII. Protein molecules are selected from phage
libraries by biopannning, a process in
which random peptides displayed on
the phage are subjected to affinity
enrichment. The sequence of the
protein is then deduced from the
DNA sequence carried by the phage.
The length of the longest random
peptides displayed is currently in the
40 amino acid range6, and the library
typically contains 108–1010 clones7.
Peptides isolated by phage display
possess many properties of natural
ligands – they bind to targets with
high affinity and can, in some cases,
adopt a unique three-dimensional
(3D) structure when bound8. The
peptides can, additionally, be potent

Peptides derived from phage
libraries of random peptides cannot
be considered to be optimized, and
are thus often subjected to mutagenesis to improve such properties as
their binding constant or biological
activity. A successful approach to
producing peptides with agonistic
properties and the 3D structure
when bound to the receptor involves
the sequential use of two phage
libraries. First, a random peptide
library is used to develop a consensus sequence, then a custom phage
library (composed of mutated
sequences related to the consensus
sequence from the first library) is
screened; the binding properties of
peptides from the primary library
are improved as a result of the secondary screening9,10. The successes
of two-phase screening imply that
the frequency of agonistic peptides
among random peptides has to be at
least one in 1016 to 1020 (i.e. the
product of the two library sizes),
which is close to the range predicted

In the analogy between playing the
game of chess and searching the protein sequence space, the search for a
protein is considered to be a series of
100 choices: choice 1 is for the
amino acid at the N terminus of the
protein, choice 2 is for residue 2, and
so on until amino acid 100 is chosen
for the C terminus of the protein.
This is analogous to playing the game
of chess, in which the selection of
moves at each stage determines the
quality of the game and the final
result. We know from the estimate
provided above5 that among the
10130 possible random ‘games’ in the
protein sequence space, 10113 to 10116
are successful (i.e. produce a binding
protein that is stable and folds into a
stable structure). How restrictive
should the selection of each ‘move’
(i.e. amino acid) have to be at each
position to yield a successful search?
It can be readily calculated (Eqn 1)
that the fraction a of all L-residuelong proteins that can be built using
a subset s of 20 amino acids is equal
a 5 (s 4 20)L


Deriving s from the above equation we get Eqn 2:
s 5 20 3 a1/L


Assuming that a 5 10217 to 10214
(as discussed above) and L 5 100,
then the number of amino acids
allowed at any position of the
polypeptide chain (s) that will lead to
one of 10113 to 10116 proteins is
approximately 14. This means that
the ability to eliminate six of the 20
amino acids during the course of this
protein-sequence-design process is
sufficient to assure the success of the
protein design. The amino acids
eliminated can vary at different positions. To be more precise, for the nth
position in the chain, the sequence of
all residues from 1 to n21 needs to
be evaluated; the six amino acids
deemed least suitable for the function
of the full-length protein may then
be eliminated as candidates for the
nth position and, finally, a random
choice can be made from the 14
remaining amino acids. This process
is then repeated for every n (from
n 5 1 to n 5 100). The reduction of
the number of amino acids needed at
any position is consistent with a
recent finding11 that a five-aminoacid alphabet is sufficient to encode
a small protein, such as the SH3
domain (it should be stressed that the
main conclusion presented here is
not to reduce the total number of
amino acids used to design proteins).
This search can be translated into
the language of chess (Table 2). If the
frequency of games of the desired
skill level is a 5 10214 to 10217 of all
possible games, taking into account
the average number of ply per game
(L 5 80) and the average number of
choices per ply (40), the number of
moves allowed at each ply will be,
from Eqn 2, approximately 26 (i.e.
40 3 a1/L). A computer playing
chess at a level analogous to the level
of successful searches in the protein
sequence space would thus need to
eliminate 14 moves out of the 40
possible moves in a typical position,
and then to choose at random one of
26 remaining moves. The level of
chess played in this fashion would be
very low – in fact, far too low to play
competitive chess. This is the level of
play to be expected from a novice
who has just learned the rules of chess
and barely knows the objectives and
general principles of the game. For
comparison, a grand-master level of
play might require that the player
correctly eliminate the worst 38 out
of 40 moves at every ply being
analysed and choose one of the two
remaining moves.



Table 2. Comparison between a chess game and searches in
protein-sequence space


80 ply (40 moves) per game
40 possible moves per ply
4080 chess games possible (5 10128)

100 positions per chain
20 amino acids used
20100 proteins possible (5 10130)

‘Monkey’ game
Make any of 40 allowed moves, play
one game

Phage display, random libraries
Pick any of 20 aa, play the ‘game’ 109 times

Grandmaster’s game
Identify 2 best moves and make one
of them.

Successful strategy
Eliminate 6 least desirable amino acid, pick
any of the remaining 14 amino acid when
growing the polypeptide chain

Current approaches to searching
the sequence space by selecting
proteins from libraries of random
proteins on filamentous phage are
analogous to playing ‘monkey’ chess,
that is, making mindless moves (but
in accordance with the chess regulations). The game, however, is
played not once but 109 times (the
size of the library) and the best of
these 109 games is selected.
Despite a few similarities, there are
differences that make playing chess
distinct from designing protein
sequences. The game of chess has a
direction, because chess pieces can
take other pieces or force certain
moves as a result of a check. The
search through chess space is such
that the previous moves are
unchangeable (i.e. if you are at a
particular point in the game you
can’t change your previous moves);
protein sequence space lacks this
temporal dimension. Additionally,
the game has an end to it, mate
or a draw. By contrast, protein
sequences can be designed equally
well (it seems) starting from the
N terminus as from the C terminus
– no directionality here. Also, the
number of moves in an average
position on a chess board is 40,
whereas the number of amino acids
available is 20, which is a rather
minor difference. More significantly,
however, the choice of amino acids
is the same regardless of the position
along the polypeptide chain, while
the choice of moves changes greatly
as the chess game progresses. This
makes it cumbersome to define the
distance function in the ‘chess space’,
while the commonly accepted distance function in the protein
sequence space is quite simple, and


equal to the number of amino acid
differences between the two proteins.
Considering that the protein
sequence space and the game of chess
are equally complex with respect to
the number of possible games or
sequences and that the search for a
good game or sequence can be done
according to similar principles (a
series of selections), what, then,
allows a grand master (or a computer
such as Deep Blue) to play the game
of chess at an advanced level, while
protein engineers cannot even play a
novice game in the sequence space?
The principal reason is that protein
engineers do not have at their disposal a fitness function to evaluate
random protein sequences, or parts
of them, for attributes such as folding, structure, stability or binding to
the target. This contrasts with the
highly accurate algorithms to rate
positions on a chessboard that are
implemented in the best computer
programs, such as Deep Blue. Why
then do we not have appropriate
algorithms to rate the fitness of protein sequences or to search the
sequence space as we do for chess?
One evident reason is that the structure of a protein of 100 amino acids
and 700 nonhydrogen atoms in three
dimensions is much more complex
than the positions on a chessboard
with no more than 32 pieces on 64
squares in two dimensions. Interactions between the atoms are incomparably more complex than the
yes–no relationship between the
pieces on the chessboard (a piece can
either take the position of another
piece or it cannot). The fact is that
we do not have the algorithms reliably to predict the 3D structure or
function of a protein from its

sequence and, furthermore, we are
not even close to having one, despite
the fact that this task is the holy grail
of the protein-science community.

The message, therefore, is that we
do not need a perfect algorithm to
predict the structure and function of
a protein from its sequence in order
to explore the sequence space successfully. Even a relatively small improvement in our ability to evaluate
a random protein sequence can add
a lot to our ability to construct proteins with desired properties de novo.
We do not even have to play the
chess equivalent of a novice game
in the sequence space to do well.
An incremental improvement combined with the growing power of
phage-display techniques and other
innovative selection techniques
could speed up the discovery and
applications of novel nature-free
I thank M. Campbell of the
IBM Deep Blue team for advice,
and A. Blume, J. Beasley and
D. Klebanov (DGI BioTechnologies)
for reading the manuscript and
useful comments.
1 Hamilton, S. and Garber, L. (1997)
Computer 30, 29–35
2 de Groot, A. (1965) Thought and Choice in
Chess, Mouton
3 Levy, D. and Newborn, M. (1991) How
Computers Play Chess, Computer Science
4 Savageau, M. (1986) Proc. Natl. Acad. Sci.
U. S. A. 83, 1198–1202
5 Mandecki, W. (1991) in Molecular
Evolution on Rugged Landscapes: Proteins,
RNA and the Immune Systems (Perelson, A. S.
and Kauffman, S. A., eds), pp. 239–254,
6 Ravera, M. W. et al. Oncogene (in
7 Lowman, H. B. (1997) Annu. Rev.
Biophys. Biomol. Struct. 26, 401–424
8 Livnah, O. et al. (1996) Science 273,
9 Cwirla, S. E. et al. (1997) Science 276,
10 Wrighton, N. C. et al. (1996) Science 273,
11 Riddle, D. S. et al. (1997) Nat. Struct. Biol.
4, 805–809

Wlodek Mandecki
DGI BioTechnologies, PO Box 424, 40
Talmadge Rd, Edison, NJ 08818, USA.
(E-mail: mandecki@dgibt.com)


the-game-of-chess-and-searches-in-protein-sequence-space.pdf - page 1/3
the-game-of-chess-and-searches-in-protein-sequence-space.pdf - page 2/3
the-game-of-chess-and-searches-in-protein-sequence-space.pdf - page 3/3

Related documents

the game of chess and searches in protein sequence space
genetic engineering
ghrp 6 peptide
13 protein synthesis and mutations
synthetic dna libraries
o6 proteins carb 1

Related keywords