Thursday, 21 August 2014

Deciphering a New Enigma: The Genetic Code

When we imagine code breaking, we conjure up images of dingy basements in the 1940s, with mathematicians striving to decipher Nazi communications during the Second World War. Undoubtedly, this is because breaking the German Enigma code became pivotal for clinching the allied victory, subsequently making it the most famous code (and code-breaking event) in modern history. Part of this breakthrough can be attributed to Alan Turing, who recently came back into the limelight due to the centennial anniversary of his birth. Cited as the father of modern computer science, Turing was a brilliant mathematician whose vision of intelligent machines was perhaps way ahead of its time, and his untimely suicide.

An Enigma Machine

Gathering intelligence by the mathematics of code breaking is of course still relevant to us today. However, modern code-breaking is surprisingly important in biology, as we are still yet to fully decipher the meaning of DNA's genetic code in every living cell. Deceptively, at first glance this code appears simple as it only uses four letters; A. C, T and G. DNA is basically a long string of these four letters, which cells read for the instructions to build proteins. Proteins are highly important to the construction of cells, enabling them to grow, divide and carry out their respective functions. Every three letters in the DNA sequence encodes one amino acid – the building-blocks of proteins –with the amino acids joining together in a chain, like beads on a necklace, following the DNA sequence. As there are 20 different amino acids, there can only one of 20 different amino acid ‘beads’ forming the protein ‘necklace’ encoded by the DNA. This amino acid sequence is said to be the primary structure of a protein.

Proteins are not merely chains of amino acids, as the amino acids are attracted to each other due to the properties of the 20 different types. This means that the beads of the necklace will form coils, twists and loops, as some of the amino acids attach to others in the chain. This is said to be the secondary structure of the protein. In turn, these coils and loops interact with each other further, and twist up the chain of amino acids to form an even more distinct three-dimensional shape, making the tertiary structure of the protein. We can picture the final result by imagining the beads of the necklace being scrunched into a tight ball.


The three-dimensional structure of proteins is governed by the sum of the relationships between the amino acids in the protein

Each protein encoded in DNA has its own unique three-dimensional structure, which is crucial for the protein to carry out its function. The three-dimensional shape determines how the protein can interact and attach with other proteins to build the structure of cells within the body, as well as form part of many important proteins like insulin and haemoglobin. If we think of how important the shape of a key is to fitting a lock, this describes a similar situation that we find with the shape of proteins, as when this shape is changed, the protein can no longer perform its allotted task. For instance in haemophilia, we know that a protein called factor VIII intrinsic to the blood-clotting process has a defective shape, resulting in the prolonged bleeding found in these individuals.

A diagram showing how the chain of amino acids in defective factor VIII fold up in haemophiliacs

Thanks to the Human Genome Project, we have found that human DNA encodes about 20,000 different protein sequences, for which deciphering the sequence of amino acids is easy as we know this part of the code; however, deciphering how these will interact to form a protein’s three-dimensional structure is trickier. Traditional techniques can be used, but deciphering the three-dimensional structure of 20,000 proteins by these methods is really not a feasible task for us to undertake, as they are slow and expensive. This is where mathematics and computational biology is now increasingly important.

By the application of mathematics we have been able to predict how the string of amino acids might form coils, twists and turns in a protein’s secondary structure.  In essence, this involves determining the statistical probability of how each amino acid will behave in respect to others on the chain. This is because common amino acid sequences are found in proteins, and we know how those sequences are likely to interact to make particular types of coil or twist. In addition, by looking at the probability of how certain amino acids are likely to interact with others, the most likely secondary structure of a protein can be predicted. This is not entirely accurate because the environment in the cell where the protein is manufactured can also influence its final structure, and common amino acid sequences may coil or twist in slightly different ways. As we all know, probability only gives us the most likely outcome, not a certain one, making these predictions only an approximation. Predicting the three-dimensional tertiary structure is therefore even more difficult; not only are there more amino acid interactions to consider, there may be some error in the secondary structure as well.

This is why the problem of unlocking the three-dimensional structure of proteins is where the code breaking of the human genome still continues. Even though we have unlocked the sequence of amino acids in the 20,000 proteins of the human genome, in many cases their corresponding three-dimensional structure is unknown. This structure is essential to the function of a protein, and can tell us what role that protein might have within the human body; we are effectively at a standstill in deciphering the code until we can do this. This presents itself a considerable problem within contemporary science, but thanks to the recent developments in mathematics and computational power, it is now far less daunting and achievable.

It does appear that there are a limited number of ways that all twenty amino acids can interact to form particular structures, and it is believed that there about 2,000 types of interaction which are common to the majority of proteins. In addition, we do know that proteins can only fold in limited ways forming specific geometrical structures, as when they do bend at certain angles or turn with specific twists, they cannot form a stable shape and would not naturally form. These rules that govern the three-dimensional structure of proteins have resulted in several mathematical algorithms that can predict the most probable protein shape from a DNA sequence. However, the drawback is these require hefty amounts of computational power, far beyond that which was present at the time of Alan Turing and the Enigma Code – only modern supercomputers have the ability to calculate all the feasible protein structures within an acceptable time from amino acid sequences. Nevertheless, there is an alternative method; by utilising the internet and the computing power of thousands of small computers across the globe. Like the millions that downloaded the SETI@home project to search for intelligent life on other planets, we all can be code-breakers like Alan Turing by downloading protein structure prediction software from projects such as the Human Proteome Project, Rosetta@home and Folding@home.

Knowing how all 20,000 proteins in the human genome form their three-dimensional structure will be incredibly important in combating complex diseases which have so far proven very difficult to treat. The huge amount of proteins that may change in conditions such as cancer, diabetes and heart disease may provide new and effective targets for drugs which can be used to treat and prevent these conditions. Not only this, we will know much more about ourselves as we will have completely decoded the DNA message in every cell in our bodies. Due to the plummeting costs of sequencing DNA, we may be able to obtain a sequence for every person on the planet, and therefore the structure of every protein within that individual – both when they are functional and when they are defective. This means that drugs may be designed for each specific individual, unique like their DNA, meaning that we may all be able to live longer, disease-free lives, all by the power of mathematics and code breaking.

Obviously, we should still celebrate the famous efforts that were needed to decipher the Enigma code, as they were pivotal to events in the 20th Century and the Second World War. However, in the 21st Century, we are still fighting a war that requires the mathematics of code-breaking – the battle against diseases with a real human cost. As we have celebrated the centenary of the famous code-breaker Alan Turing, perhaps in another century we will be celebrating the birth of another great code-breaking event: breaking the code that is found within DNA. Not only this, but it may have been an effort from millions of people across the world, thanks to mathematics and modern computing.

Further Reading:

The Human Genome: Book of Essential Knowledge, John Quackenbush (2011)

The $1000 Genome: The Revolution in DNA Sequencing and the New Era in Personalised Medicine, Kevin Davies (2010)

No comments:

Post a Comment