Science has always been a quest to get nature to disclose her secrets, and for a curious scientist, no secret is more seductive than one concealed in a code.

Among secret codes, the most famous is the one that has been around the longest — the code used by the genetic mechanism governing life itself. For billions of years, living cells built themselves from a blueprint encoded in molecules of DNA. Only the cells knew how to read that code, spelled out using just four letters — the abbreviations for DNA’s building blocks.

Half a century ago, humans cracked the code. But it still hides even deeper secrets. Why did life choose that particular code, for instance? And what wizardly protobiotic alchemy brought it into existence?

When the code was revealed, scientists proposed several scenarios for its evolution. Those suggestions remain among the leading ideas today. But that’s because in the intervening decades little progress has been made. Those original explanations provided “far from an adequate understanding of the code’s evolution,” write Eugene Koonin and Artem Novozhilov in the 2017 Annual Review of Genetics. “Notwithstanding the complete transformation of biology that occurred over these decades, we do not seem to be much closer to the solution.”

Scientists have naturally wondered why all life speaks in a code with so many synonyms.

Biologists do have a solid understanding of how the code embedded in the DNA blueprint works. Based on that code, cells make molecules, typically proteins — long chains created by linking amino acids together. DNA’s molecular cousin, RNA, decrypts the coded blueprint to determine the order in which various amino acids should be linked. The resulting chains contort themselves into elaborately folded molecular multitaskers that provide the structures and perform the functions of life. Virtually all earthly life, from bacteria to baboons, relies on the same genetic lexicon.

It seems like a simple code: three-letter words composed from an alphabet of only four letters. Since four letters can generate 64 distinct three-letter words, nature should in principle be able to cook up proteins incorporating several dozen different amino acids. But the protein recipe book comprises a repertoire of only 20; many different words signify the same amino acid. Scientists have naturally wondered why all life speaks in a code with so many synonyms. Solving that riddle, though, would require the ability to peer deep into the biological past, to discover how the molecular system for translating the DNA blueprint into actual proteins originated and evolved. That task seems hopeless, for the code surely began to evolve even before the sprouting of today’s known branches of the tree of life.

The universal code of life

“The problem appears extremely and unusually hard,” write Koonin and Novozhilov. “This is one of the most fundamental and hardest problems in all of biology.”

There are clues, though. For one thing, the code is very nearly universal, observed by practically all known life-forms. Occasional exceptions are “minor and of secondary origin,” Koonin and Novozhilov say. Second, the code does not match DNA words to amino acids at random: Patterns in the letter combinations help ensure the fidelity of the decoding process, minimizing errors. (The code is not, however, the optimal possible code for avoiding errors, perhaps another clue to the evolution mystery.) Third, it’s clear that the code did not spring into existence all at once; it probably began as a more limited code for a handful of amino acids, and only over time evolved into today’s code for 20.

Understanding the code’s origins, Koonin and Novozhilov emphasize, will require more than just solving a mathematical cryptographic riddle. You also need to know details of the biology. Transforming the hidden DNA message into working proteins is not as simple as decompressing a zip file. Other molecules come into play — chiefly variants of RNA, and proteins that assist in “translating” the messages coded in DNA’s building blocks, or nucleotides.

The code-containing parts of DNA nucleotides are four chemical bases: adenine, guanine, thymine and cytosine. DNA’s structure is maintained by these bases, pairs of which interlock to hold DNA’s two helical strands together (A pairs with T, G with C). Proteins transcribing the bases in one of the DNA strands manufacture RNA molecules composed of a corresponding string of bases (with one tiny variant — RNA uses uracil instead of thymine in its code).

Each set of three consecutive RNA bases is called a codon, a three-letter word specifying an amino acid (or, in a few cases, a command to stop making the protein). The codon UCG, for example, corresponds to the amino acid serine; CCG encodes proline; ACG means threonine. These “messenger” RNA molecules, consisting of strings of codons, travel to the cell’s protein factory, the ribosome, where another member of the RNA family (transfer RNA) reads the messenger RNA and recruits the proper amino acids, in the proper order, to be assembled by the ribosome.

While this description is misleadingly simplified, it captures the essence of how the code works. It does not, however, reveal how or why that specific code evolved in the first place. There are three main proposals, each with various degrees of supporting evidence. But none, Koonin and Novozhilov conclude, is thoroughly satisfactory.

One proposal suggests that the shape of some primeval RNA molecules matched the structure of certain amino acids, leading them to hook up and establish their relationship in a code.

A second idea posits that the code evolved in step with amino acid biochemistry. At first, only a few amino acids existed, so many “words” encoded the same acid. Then when primitive metabolism created new amino acids, some words shifted their meaning to code for the newcomers. Thus the code would have evolved for the benefit of diversifying the cell’s proteins by incorporating new amino acids. It sounds logical. But if that’s what happened, today’s code should retain certain mathematical patterns that are not seen, Koonin and Novozhilov note.

A third proposal cites the code’s reliability. Perhaps evolutionary pressure led to a code that minimizes errors in matching words to amino acids. In that respect the code does a good job — at least, it does better than most codes chosen at random would. Mathematical analyses show that a random code would have less than a one-in-a-million chance of being better at limiting errors than today’s code. On the other hand, the number of possible codes (using three-letter words, with a four-letter alphabet, for 20 amino acids) is enormous — something like a million times as many such codes exist as there are atoms in the universe. So even though the code is pretty good, many others would be even better. Besides, resistance to error might just naturally evolve as “a neutral by-product of evolution driven by other factors,” Koonin and Novozhilov observe. For that matter, preventing errors altogether might not be especially advisable, because it would limit the prospect for new “accidental” additions of potentially beneficial amino acids.

Perhaps all of these proposals contain elements that played a part in the evolution of the code. But the complete story remains hidden in the deep evolutionary past, before the appearance of what is known by the acronym LUCA, the “last universal common ancestor” of all the life-forms around today. Before the triumph of that cellular ancestor, various groups of viruslike genetic elements competed for supremacy. Different groups of such elements no doubt developed their own “codes” for producing molecules. Koonin and Novozhilov cite an important paper from 2006 by Kalin Vetsigian and collaborators which showed that only one such code (or at most very few) was likely to survive. That conclusion was based on a computer simulation of how primitive life-forms exchanged genes with one another. This “horizontal gene transfer” was essential to the early evolution of life, Koonin and Novozhilov write. Without it, “the transition to the cellular level of complexity simply could not have occurred,” and horizontal gene transfer will work only if the codes in both partners exchanging genes are identical.

If this picture is right, the problem of revealing the genetic code’s origins acquires a new aspect. It may not be such a mystery that all Earth’s life, descended from a common ancestor, observes the same code — it may be that a universal code is necessary in the first place to permit the Darwinian descent of various life-forms from a common ancestor. As Vetsigian and colleagues wrote, “a universal code may be a necessary precondition for common ancestry, indeed even for life as we know it.”

So the very existence of the code, and details about its structure — such as its many synonyms — may therefore contain clues to what life was like before the last universal common ancestor. “Extensive duplication within the translation system opens a window into the deep pre-LUCA past,” Koonin and Novozhilov say.

This realization turns the difficulty of discerning the origin of the genetic code on its head. It may seem to be a hopeless problem because solving it would require time travel into the era before life’s last common ancestor. There the trail of life’s history by DNA descent ends. Traveling farther back seems impossible — until you realize that the genetic code itself provides the clues to what happened before LUCA. It’s a great scientific situation when an impossible problem conceals within itself the secret to its solution. You just have to figure out how to crack the code.