a gene. Cells build genes all the time: each time a cell divides, it makes a copy of every gene. If a biochemist could strap himself to the gene-copying enzyme (DNA polymerase), straddling its back as it made a copy of DNA and keeping tabs as the enzyme added base upon base—A, C, T, G, C, C, C, and so forth—the sequence of a gene would become known. It was like eavesdropping on a copying machine: you could reconstruct the original from the copy. Once again, the mirror image would illuminate the original—Dorian Gray would be re-created, piece upon piece, from his reflection.
In 1971, Sanger began to devise a gene-sequencing technique using the copying reaction of DNA polymerase. (At Harvard, Walter Gilbert and Allan Maxam were also devising a system to sequence DNA, although using different reagents. Their method also worked, but was soon outmoded by Sanger’s.) At first, Sanger’s method was inefficient and prone to inexplicable failures. In part, the problem was that the copying reaction was too fast: polymerase raced along the strand of DNA, adding nucleotides at such a breakneck pace that Sanger could not catch the intermediate steps. In 1975, Sanger made an ingenious modification: he spiked the copying reaction with a series of chemically altered bases—ever-so-slight variants of A, C, G, and T—that were still recognized by DNA polymerase, but jammed its copying ability. As polymerase stalled, Sanger could use the slowed-down reaction to map a gene by its jams—an A here, a T there, a G there, and so forth—for thousands of bases of DNA.
On February 24, 1977, Sanger used this technique to reveal the full sequence of a virus—ΦX174—in a paper in Nature. Only 5,386 base pairs in length, phi was a tiny virus—its entire genome was smaller than some of the smallest human genes—but the publication announced a transformative scientific advance. “The sequence identifies many of the features responsible for the production of the proteins of the nine known genes of the organism,” he wrote. Sanger had learned to read the language of genes.
The new techniques of genetics—gene sequencing and gene cloning—immediately illuminated novel characteristics of genes and genomes. The first, and most surprising, discovery concerned a unique feature of the genes of animals and animal viruses. In 1977, two scientists working independently, Richard Roberts and Phillip Sharp, discovered that most animal proteins were not encoded in long, continuous stretches of DNA, but were actually split into modules. In bacteria, every gene is a continuous, uninterrupted stretch of DNA, starting with the first triplet code (ATG) and running contiguously to the final “stop” signal. Bacterial genes do not contain separate modules, and they are not split internally by spacers. But in animals, and in animal viruses, Roberts and Sharp found that a gene was typically split into parts and interrupted by long stretches of stuffer DNA.
As an analogy, consider the word structure. In bacteria, the gene is embedded in the genome in precisely that format, structure, with no breaks, stuffers, interpositions, or interruptions. In the human genome, in contrast, the word is interrupted by intermediate stretches of DNA: s . . . tru . . . ct . . . ur . . . e.
The long stretches of DNA marked by the ellipses (. . .) do not contain any protein-encoding information. When such an interrupted gene is used to generate a message—i.e., when DNA is used to build RNA—the stuffer fragments are excised from the RNA message, and the RNA is stitched together again with the intervening pieces removed: s . . . tru . . . ct . . . ur . . . e became simplified to structure. Roberts and Sharp later coined a phrase for the process: gene splicing or RNA splicing (since the RNA message of the gene was “spliced” to remove the stuffer fragments).
At first, this split structure of genes seemed puzzling: Why would an animal genome waste such long stretches of DNA splitting genes into bits and pieces, only to stitch them back into a continuous message? But the inner logic of split genes soon became evident: by splitting genes into modules, a cell could generate bewildering combinations of messages out of a single gene. The word s . . . tru . . . c . . . t . . . ur . . . e can be spliced to yield cure and true and so forth, thereby creating vast numbers of variant messages—called isoforms—out of a single gene. From g . .