Tuesday, May 06, 2014

Biology's Dirty Little Secret

Biology has a dirty little secret. It's a well-known secret among those who deal with sequenced-genome data intensively, but I suspect many non-biologists are unaware of the problem, which is: Much of the existing genome data (for sequenced genomes, ranging from bacteria to human DNA) is either corrupt or misannotated.

"Junk DNA" probably doesn't exist in living cells. But it certainly exists in published genomes.

A substantial portion of published genome data is suspect, at this point, either because of contamination issues, technical problems surrounding DNA sequencing technology, or faulty gene annotation. An example is the Oryza sativa indica (rice) genome, which inexplicably contains at least 10% of the genome of the bacterium Acidovorax citrulli. There's also a Culex (mosquito) genome with a complete copy of Wolbachia embedded. The genome of Rothia mucilaginosa DY-18 contains over 300 genes incorrectly annotated in antisense orientation (as does the genome of Burkholderia pseudomallei strain 1710b, a truly execrable train-wreck of a genome).

Another example of a genome gone wrong (arguably) is that of the bacterium Ktedonobacter racemifer, which is filled with forward and backward copies of transposases. Incredibly, one in 13 Ktedonobacter genes is a transposase, integrase, or resolvase (and that's not counting the many "hypothetical proteins" with "transposase-like" mentioned in the gene ontology notes). Disregarding the 40% of that organism's genes that are marked as hypothetical proteins, one can say that in Ktedonobacter, one in four genes of known function is a transposase, integrase, or resolvase. (Some of the organism's 4000+ "hypothetical proteins" are actually transposases incorrectly annotated in an antisense orientation.) Common sense says something's amiss.

The misannotation problem is getting worse over time. This graph (from Schnoes et al., 2009, PLoS ONE) depicts the number of sequences (left y-axis, bar graph) found to be correctly annotated in green. Sequences found to be misannotated are shown in red. The bars for each year represent only the sequences deposited into the database in that year. The fraction (right y-axis, line plot) of sequences deposited each year into the Genbank NR database that were misannotated is given by the open nodes, connected by the black line to aid in visualizing the overall trend.

The "dark matter" problem in microbial genetics is widespread and openly acknowledged. At least 20% 28.3% (according to the Joint Genome Institute) of bacterial genes are annotated as "hypothetical protein," and most of these are so annotated because they have no sequence similarity match to any known protein. In many cases, there's no match because many of the sequences are in the wrong reading frame, or have an improperly located start codon (or other serious issues). When Ely and Scott (PLoS ONE, 2014) manually reannotated the genome of the bacterium Caulobacter crescentus, they identified 11 new genes, modified the start site of 113 genes, changed the reading frame of 38 genes, and found that 112 "hypothetical proteins" were actually non-coding DNA (not genes at all). A recent transcriptome analysis of the archaeon Sulfolobus solfataricus resulted in correction of 162 gene annotations and the addition of 80 new open reading frames. But these numbers barely hint at the extent of gene misannotation. In examining the Gene Ontology database (GOSeqLite), Jones et al. found:
Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%.
Surprisingly, the use of sequence similarity as a guide to function identification is less reliable than non-SS methods. This is no doubt partly a reflection of the fact that gene databases contain  a great deal of aberrant data. Gene-annotation programs like the widely used Glimmer (Gene Locator and Interpolated Markov Modeler) have to be trained, using a training set. If the training set contains faulty data, it's a classic GIGO situation.

The annotation accuracy problem is getting worse by the year (see graph above). Devos and Valencia estimated in 2001 that misannotation levels could be as high as 37%. More recently, Schnoes et al. (2009) concluded that "function prediction error (i.e., misannotation) is a serious problem in all but the manually curated database Swiss-Prot," and yet Artamonova et al. (2005) found that for five types of annotation entries, even the vaunted UniProt/SwissProt database had an error rate between 33% and 43%. So even the best manually curated database is full of errors.

I've spent many hours examining bacterial genomes and it's been my experience that annotation quality is uniformly poor for all but the best-known genes in the best-curated genomes. By "best-known genes," I mean things like ribosomal protein genes, genes for well-known polymerases and chaperones, well-studied metabolic-pathway genes, and so on. The vast majority of genes encode proteins that have not been studied (and may never be studied) experimentally. All of these are suspect and need to be treated with caution, particularly in high-GC genomes where programs like Glimmer frequently can't distinguish between sense and antisense strands.

As an example: Pseudomonas aeruginosa MPAO1 has a gene, O1Q_25367, encoding a phenol hydroxylase, which is an enzyme required for catabolic breakdown of phenols (the kind of thing many Pseudomonas species are good at). If you run a BLAST search of this gene against the UnitProt.org database, you'll find dozens of good-quality hits against phenol hydroxylases of many organisms (including Rhizobium sp. Pop5, Natronolimnobius innermongolicus JCM 12255, Rhodococcus wratislaviensis IFP 2016, and quite a few others). "Good-quality" here means more than 50% amino acid sequence identities, with E-values of ~10-50. But there's a problem. If you take the DNA sequence for the Pseudomonas phenol hydroxylase and reverse-translate it using the online app at http://web.expasy.org/translate, the resulting protein sequence is a 99% or better match (E=0) for over a hundred Pseudomonas maltose/mannitol transporters. (It's a 100% match in 85 cases.) In other words, the so-called phenol hydroxylase gene is not a phenol hydroxylase gene at all. It's a sugar transporter, backwards. You can do the same trick with the "phenol hydroxylase" of Gordonia terrae C6. Reverse-translate it and it's a much better sugar permease (dozens of E=0 matches) than it is a phenol hydroxylase.

This view of a segment of P. aeruginosa and Streptomyces genomes shows a region of 60% homology (pink band) between the two genomes for two genes. The yellow gene is each case is a "phenol hydroxylase."

How does a program like Glimmer not catch things like this? In this case, Glimmer became confused by an apparent overlapping-gene situation (above). The program found two open reading frames, in the same part of the chromosome, but on opposite strands. Glimmer 3 is supposedly much better at resolving overlaps, but Glimmer 2 (which was used to annotate around half the genomes currently available in public databases) relied on unbelievably crude heuristics to resolve overlap problems. In the above case (perhaps through operator error; these things are configurable, to a degree) Glimmer simply designated each ORF as a legitimate gene, when in fact a BLAST check leaves little doubt the phenol hydroxylase gene is (in reality) a backwards maltose/mannose transporter gene.

My experience has been that Glimmer is very easily confused by high-GC genome data, with organisms like Pseudomonas (typically 63% to 67% GC) showing a greater number of reverse-annotated genes (and no-BLAST-matches "hypothetical proteins") than low-GC organisms like Buchnera, although it should be noted that even E. coli strains tend to contain many suspect annotations. The problem is partly due to the fact that high-GC-content DNA contains relatively few stop codons in the six different reading frames, compared to low-GC DNA (which is rich in stop codons if deciphered the wrong way). Also a problem for Glimmer is the fact that in high-GC genomes, codons tend to contain a purine in the first base position whether they're read forward or in reverse.

I've written before about the fact that codons tend to occur with frequencies roughly equal to that of their reverse-complement twins. This is another confounding factor for Glimmer (codons look the same whether read in the forward direction or the reverse direction), although honestly, I'm beginning to wonder if the codon/anticodon symmetries I've been seeing aren't simply due to widespread reverse annotation of genes (misannotation of a gene's anticoding stand, as with "phenol hydroxylase").

One might naively suppose that it shouldn't be hard to discriminate the "sense" strand accurately, given that so many genes begin with a Shine Dalgarno sequence ahead of the start codon. But in reality, it turns out that not very many organisms make extensive use of SD sequences. Short motifs like GAGG, which occur randomly at a high rate, can be mistaken for a Shine Dalgarno sequence. This only aggravates the false-positive rate.

Erroneous gene annotations are rampant in bacteria, but some authors have suggested that the problem is far worse in eukaryotic genomes. If that's true, we're in trouble.

All of this puts bioinformatics research at a crisis point. Gene discovery algorithms are good (maybe a bit too good) at finding open reading frames but poor at identifying and assigning gene functions, in part because we lack the kind of in-depth understanding of protein folding required for prediction of 3-dimensional structures (and active-site conformations) programmatically, a capability that's sorely needed if we're to progress out of the gene-annotation Stone Age we're now living in. (Protein folding is a Hard Problem requiring supercomputers to untangle.) Maybe in ten years (or twenty?), we'll be able to predict 3D protein structures computationally, and on that basis make better ab initio predictions of protein function. Right now, we have to make do with relatively crude Markov-model pattern recognition software, aided by human intervention, to come up with even a minimally reliable genome annotation. But we have the means, already, of doing much better crosschecks. Some of that can be automated. We just need to have the will to do it.