When we imagine code breaking, we conjure up images of dingy
basements in the 1940s, with mathematicians striving to decipher Nazi
communications during the Second World War. Undoubtedly, this is because
breaking the German Enigma code became pivotal for clinching the allied victory,
subsequently making it the most famous code (and code-breaking event) in modern
history. Part of this breakthrough can be attributed to Alan Turing, who
recently came back into the limelight due to the centennial anniversary of his
birth. Cited as the father of modern computer science, Turing was a brilliant
mathematician whose vision of intelligent machines was perhaps way ahead of its
time, and his untimely suicide.
An Enigma Machine
Gathering intelligence by the mathematics of code breaking
is of course still relevant to us today. However, modern code-breaking is
surprisingly important in biology, as we are still yet to fully decipher the meaning of DNA's genetic code in every living cell. Deceptively, at first glance this code appears simple as it only uses
four letters; A. C, T and G. DNA is basically a long string of these four
letters, which cells read for the instructions to build proteins. Proteins are highly
important to the construction of cells, enabling them to grow, divide and carry
out their respective functions. Every three letters in the DNA sequence encodes
one amino acid – the building-blocks
of proteins –with the amino acids joining together in a chain, like beads on a
necklace, following the DNA sequence. As there are 20 different amino acids,
there can only one of 20 different amino acid ‘beads’ forming the protein
‘necklace’ encoded by the DNA. This amino acid sequence is said to be the primary structure of a protein.
Proteins are not merely chains of amino acids, as the amino
acids are attracted to each other due to the properties of the 20 different types.
This means that the beads of the necklace will form coils, twists and loops, as
some of the amino acids attach to others in the chain. This is said to be the secondary structure of the protein. In
turn, these coils and loops interact with each other further, and twist up the
chain of amino acids to form an even more distinct three-dimensional shape,
making the tertiary structure of the
protein. We can picture the final result by imagining the beads of the necklace
being scrunched into a tight ball.
The three-dimensional
structure of proteins is governed by the sum of the relationships between the amino acids in
the protein
Each protein encoded in DNA has its own unique three-dimensional
structure, which is crucial for the protein to carry out its function. The
three-dimensional shape determines how the protein can interact and attach with
other proteins to build the structure of cells within the body, as well as form
part of many important proteins like insulin and haemoglobin. If we think of
how important the shape of a key is to fitting a lock, this describes a similar
situation that we find with the shape of proteins, as when this shape is
changed, the protein can no longer perform its allotted task. For instance in
haemophilia, we know that a protein called factor VIII intrinsic to the blood-clotting process has
a defective shape, resulting in the prolonged bleeding found in these
individuals.
A diagram showing how the chain of amino acids in defective factor VIII fold up in haemophiliacs
Thanks to the Human Genome Project, we have found that human
DNA encodes about 20,000 different protein sequences, for which deciphering the
sequence of amino acids is easy as we know this part of the code; however,
deciphering how these will interact to form a protein’s three-dimensional structure
is trickier. Traditional techniques can be used, but deciphering the
three-dimensional structure of 20,000 proteins by these methods is really not a
feasible task for us to undertake, as they are slow and expensive. This is
where mathematics and computational biology is now increasingly important.
By the application of mathematics we have been able to predict
how the string of amino acids might form coils, twists and turns in a protein’s
secondary structure. In essence, this
involves determining the statistical probability of how each amino acid will behave
in respect to others on the chain. This is because common amino acid sequences
are found in proteins, and we know how those sequences are likely to interact
to make particular types of coil or twist. In addition, by looking at the
probability of how certain amino acids are likely to interact with others, the
most likely secondary structure of a protein can be predicted. This is not
entirely accurate because the environment in the cell where the protein is manufactured
can also influence its final structure, and common amino acid sequences may
coil or twist in slightly different ways. As we all know, probability only
gives us the most likely outcome, not a certain one, making these predictions
only an approximation. Predicting the three-dimensional tertiary structure is therefore
even more difficult; not only are there more amino acid interactions to
consider, there may be some error in the secondary structure as well.
This is why the problem of unlocking the three-dimensional
structure of proteins is where the code breaking of the human genome still
continues. Even though we have unlocked the sequence of amino acids in the
20,000 proteins of the human genome, in many cases their corresponding three-dimensional
structure is unknown. This structure is essential to the function of a protein,
and can tell us what role that protein might have within the human body; we are
effectively at a standstill in deciphering the code until we can do this. This presents
itself a considerable problem within contemporary science, but thanks to the
recent developments in mathematics and computational power, it is now far less
daunting and achievable.
It does appear that there are a limited number of ways that
all twenty amino acids can interact to form particular structures, and it is
believed that there about 2,000 types of interaction which are common to the
majority of proteins. In addition, we do know that proteins can only fold in
limited ways forming specific geometrical structures, as when they do bend at
certain angles or turn with specific twists, they cannot form a stable shape
and would not naturally form. These rules that govern the three-dimensional
structure of proteins have resulted in several mathematical algorithms that can
predict the most probable protein shape from a DNA sequence. However, the
drawback is these require hefty amounts of computational power, far beyond that
which was present at the time of Alan Turing and the Enigma Code – only modern
supercomputers have the ability to calculate all the feasible protein
structures within an acceptable time from amino acid sequences. Nevertheless,
there is an alternative method; by utilising the internet and the computing
power of thousands of small computers across the globe. Like the millions that
downloaded the SETI@home project to search for intelligent life on other
planets, we all can be code-breakers like Alan Turing by downloading protein
structure prediction software from projects such as the Human Proteome Project,
Rosetta@home and Folding@home.
Knowing how all 20,000 proteins in the human genome form
their three-dimensional structure will be incredibly important in combating complex
diseases which have so far proven very difficult to treat. The huge amount of
proteins that may change in conditions such as cancer, diabetes and heart
disease may provide new and effective targets for drugs which can be used to
treat and prevent these conditions. Not only this, we will know much more about
ourselves as we will have completely decoded the DNA message in every cell in
our bodies. Due to the plummeting costs of sequencing DNA, we may be able to
obtain a sequence for every person on the planet, and therefore the structure
of every protein within that individual – both when they are functional and
when they are defective. This means that drugs may be designed for each
specific individual, unique like their DNA, meaning that we may all be able to
live longer, disease-free lives, all by the power of mathematics and code
breaking.
Obviously, we should still celebrate the famous efforts that
were needed to decipher the Enigma code, as they were pivotal to events in the 20th Century and the Second World War. However, in the
21st Century, we are still fighting a war that requires the
mathematics of code-breaking – the battle against diseases with a real human
cost. As we have celebrated the centenary of the famous code-breaker Alan
Turing, perhaps in another century we will be celebrating the birth of another
great code-breaking event: breaking the code that is found within DNA. Not only
this, but it may have been an effort from millions of people across the world, thanks
to mathematics and modern computing.
Further Reading:
The Human Genome: Book of Essential Knowledge, John
Quackenbush (2011)
The $1000 Genome: The Revolution in DNA Sequencing and the
New Era in Personalised Medicine, Kevin Davies (2010)