Gene function prediction for complete genomes | Computational Biology and Data Mining

We used GeneQuiz for the analysis of the Haemophilus influenza genome [1], the first genome of a living organism ever fully sequenced.

In the frame of the yeast EU yeast genome project we sequenced and analysed a fragment of 130 Kb of yeast chromosome XV [2]. 59 non-overlapping open reading frames (ORFs), 3 tRNA genes, 4 delta elements and one Ty element are found. 36% of the ORFs were previously known at the time of the analysis including nucleoporin, ras protein, RNA polymerase III and elongation factor 2. We found homology for 53% of the ORFs in the database. Main findings were a homologous to the human OCRL gene, an ADP ribosylation factor, a protein with three C2 domains, and a homologous to a B. subtilis cell cycle protein.

We used GeneQuiz for the analysis of the Methanococcus jannaschii genome, the first complete sequenced archaeal genome [3]. Comparison with two previous manual analyses (Bult et al, Science 1996; Kyrpides et al, Microbial and Comparative Genomics, 1996) showed that although GeneQuiz was an automatic system, it gave high quality results. From a total of 1682 chromosomal ORFs, the three groups were not able to assign a function on 848 cases, and the same function was assigned in 622 cases. In 23 cases GeneQuiz found new functions.

Large scale genome projects provide a large set of biochemically uncharacterized hypothetical ORFs (not necessarily expressed as proteins). In the case of the yeast genome a cut-off of 100 amino acids has been set in order to report the ORFs in the database. As a result, many interesting smaller sequences could be lost. To check this possibility, we have analysed short ORFs from the yeast genome using a combination of a statistical method based on biological sequence properties followed by a GeneQuiz analysis [4]. We report 10 new sequences among them, an alcohol dehydrogenase, a ribosomal L36 protein, and a likely cell-cycle related protein.

Following the sequencing of the complete yeast genome, we have scanned it for homologue sequences to Human Disease Related Proteins (HDRP) [5]. The yeast homologues can constitute good experimental models for the study of the disease. In order to characterize closely related homologues, we check every human-yeast pair for closer human paralogues to the yeast sequence. The automatic analysis of GeneQuiz was done for more than 600 HDRPs. We checked manually the yeast-pair relationships, concentrating the study in those cases where the yeast protein was of uncharacterized function, and especially where the relationship was very distant.

Protein function has a hierarchical organization. It can be defined for an individual protein, for a subfamily, or for a family. Proteins of different protein families can be further grouped in functional classes and superclasses (e.g., information, energy or communication related functions). We have used the level of functional classes for genome comparison namely using the first full genome available from each kingdom, i.e., H. influenzae for bacteria, M. jannaschii for archea, and S. cerevisiae for eukarya [6]. We have discussed the distribution of proteins belonging to each functional class in these organisms and the presence of homologs in other kingdoms focusing on three aspects: (i) H. influenzae compared to eukarya, (ii) S. cerevisiae compared to bacteria, and (iii) M. jannaschii compared to bacteria and eukarya.

We have re-annotated the complete genome of Mycoplasma pneumoniae four years after the complete sequencing and first analysis were done [7]. This update indicates how new information in the databases can bring substantial increments in the amount of information of a given genome. We add new open reading frames, changed the length of others, and add or modify existing annotations.

We have assesed protein function annotation by GeneQuiz by analysis of 31 available genome sequences [8]. Function can be predicted for almost two-thirds of the 73,500 genes that were analyzed. The variations in function prediction across species and time are commented. In a more detailed analysis of GeneQuiz performance with compared the GeneQuiz annotation of the complete genome of Chlamidia trachomatis (serovar D) with the published annotation by the original group that sequenced that organism, and with our own manual annotation [9]. We identified possible sources of error, and suggest guidelines for avoiding them.

The genome of Mycobacterium leprae contains a large amount of pseudogenes (more than 1000 - compare to its 1600 coding genes), likely due to its recent evolution as a pathogenic species. We observed that these pseudogenes have a weak tendency to be situated in the last half of operons. Further analyses of more than 600 prokaryotic genomes confirmed that this trend is general. Using some studies of essential genes we observed that these essential genes have the complementary trend to be in the first half of operons. Taking pseudogenes and essential genes as markers of genes of low and high functional importance, respectively, we take these findings as suggesting that there is a significant tendency for the genes in operons to be arranged in decreasing order of importance [10].

A large fraction of bacterial genomes (40%) contain CRISP-Cas systems that they acquire to prevent viral infections using small DNA fragments (spacers) that help them target specific phages. We analyzed the spacers in tens of thousands of genomes from six bacterial species and increased their assignments to target virus from 10-20% to 85% of the spacers for some of these species [11]. We also observed that particular sets of membrane proteins appear in genomes with associated CRISPR-Cas types. These results suggest that bacteria can acquire membrane proteins that give them an evolutionary advantage, but if these proteins are used as receptor by phages, then they are forced to acquire the corresponding CRISP-Cas system to fight the phage.

References

[1] Casari, G., M.A. Andrade, P. Bork, J. Boyle, A. Daruvar, C. Ouzounis, R. Schneider, R. Tamames, A. Valencia and C. Sander, 1995. Challenging times for bioinformatics. Nature, 376, 647-648.

[2] Voss, H., Benes, V., Andrade, M.A., Valencia, A., Rechmann, S., Teodoru, C., Schwager, C., Paces, V., Sander, C., Ansorge, W. 1997. DNA sequencing and analysis of 130 kilobases from yeast chromosome XV. Yeast. 13, 655-672.

[3] Andrade, M.A., G. Casari, A. Daruvar, C. Sander, R. Schneider, J. Tamames, A. Valencia and C. Ouzounis. 1997. Sequence analysis of the Methanococcus jannaschii genome and the prediction of protein function. CABIOS. 13, 481-483.

[4] Andrade, M.A., A. Daruvar, G. Casari, R. Schneider, M. Termier and C. Sander. 1997. Characterization of new proteins found by analysis of short open reading frames from the full yeast genome. Yeast, 13, 1363-1374.

[5] Andrade, M.A., C. Sander and A. Valencia. 1998. Updated catalogue of homologues to human-disease related proteins in the yeast genome. FEBS Letters. 426, 7-16.

[6] Andrade, M.A., C. Ouzounis, C. Sander, J. Tamames and A. Valencia. 1999. Functional classes in the three domains of life. J. Mol. Evol. 49, 551-557.

[7] Dandekar, T., M. Huynen, J.T. Regula, C.U. Zimmermann, B. Ueberle, M.A. Andrade, T. Doerks, L. Sánchez-Pulido, B. Snel, M. Suyama, Y.P. Yuan, R. Herrmann and P. Bork. 2000. Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames. Nucleic Acids Research. 28, 3278-3288.

[8] Iliopoulos, I., S. Tsoka, M.A. Andrade, P. Janssen, B. Audit, A. Tramontano, A. Valencia, C. Leroy, C. Sander and C.A. Ouzounis. 2001. Genome sequences and great expectations. Genome Biology. 2, INTERACTIONS0001.

[9] Iliopoulos, I., S. Tsoka, M.A. Andrade, A.J. Enright, M. Carroll, P. Poullet, V. Promponas, T. Liakopoulos, G. Palaios, C. Pasquier, S. Hamodrakas, J. Tamames, A.T. Yagnik, A. Tramontano, D. Devos, C. Blaschke, A. Valencia, D. Brett, D. Martin, C. Leroy, I. Rigoutsos, C. Sander and C.A. Ouzounis. 2003. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics. 19, 717-726.

[10] Muro, E.M., N. Mah, G. Moreno-Hagelsieb, M.A. Andrade-Navarro. 2010. The pseudogenes of Mycobacterium leprae reveal the functional relevance of gene order within operons. Nucleic Acids Research. 39, 1732-1738.

[11] Rubio, A., Sprang, M., A. Garzón, A. Moreno-Rodriguez, M.E. Pachón-Ibáñez, J. Pachón, M.A. Andrade-Navarro and A.J. Pérez-Pulido. 2023. Analysis of bacterial pangenomes reduces CRISPR dark matter and reveals strong association between membranome and CRISPR-Cas systems. Science Advances. 9, eadd8911.