Protein classification | Computational Biology and Data Mining

Classification of proteins of a family

We have used an improved Kohonen self-organising map (SOM) for the classification of the aligned protein sequences of a family [1]. The algorithm tracks the convergence of the map weights and then the training neighbourhood is decreased when the convergence stops. Classifications with different resolution levels give information about the hierarchy of the family. Trees constructed from the set of classifications are not structurally different from those obtained with phylogenetic tree algorithms (e.g., CLUSTAL) but display a more homogeneous branching scheme. The analysis of the residues conserved at different levels using the SOM vectors gives insights about the determinant residues of the family.

Clustering of protein datasets

We designed a strategy for reducing the computational time of clustering a large set of protein sequences containing several families [2]. The method uses a sequence similarity search and several thresholds of similarity are applied in order to build the protein clusters.

In 2007, we presented a study of the increasing redundancy in the new protein sequences deposited in the databases using a pragmatic clustering method that puts together sequences that have similar length. We predicted that the total number of different proteins on Earth (the Earth's proteome) would be in the order of 5 million, of which we had known around 30%, and that in the next five years most of those sequences will be known [3]. We propose how to use this clustering to examine taxonomic ranges that need further genome sequencing, and to direct experimental studies to proteins from large uncharacterized families. We implemented the method as a web tool (FASTA Herder [4]). In 2017, we updated the study and found that by 2020 few proteins will enter the databases that will have no know homolog, revising down our estimation of diverse proteins on Earth to 3.75 million [5].

We worked further in the clustering methods to account for fragments; various applications that allow browsing and querying these clusters for various properties were made available (FastaHerder2 [6]) and a web service implemented that clusters and annotates the results of a BLAST search (CABRA [7]).

We applied sequence clustering for the study of the large family of insect Odorant Receptors (ORs), which is largely expanded in some insect species [8]. We collected and curated a set of 3,902 sequences from 21 species. Using a machine learning approach on the multiple sequence alignment of the family and three public datasets profiling the response to odorants of ORs of three species, we predicted positions relevant for selectivity to odorants. We observed that OR subfamilies that expanded largely in social insects have high conservation in these sites suggesting that they are finely tuned to very similar odorants. The dataset of sequences and clusters is publicly available ([iORmE]).

Computing orthologs and paralogs

Part of the interpretation of the results of a protein sequence similarity search is the evaluation of the homologs of the query protein in terms of orthologs and paralogs. orthoFind facilitates this analysis by evaluating the homologs according to the distribution of the hits in species, and complementing this with reports on domains and functions [9].

In a different approach, we developed a tool (ProteinPathTracker) that aids the investigation of the evolutionary path of a protein [10]. It is based in the retrieval of homologs (if possible, orthologs) in a series of species at increasing taxonomic distances from a central species of interest. These homologs should inform of the ancestral sequences leading to the sequence of interest. The tool offers to test a series of evolutionary paths, for example, from human as central species to bacteria. The user can start a step-wise search for orthologs from species to species from a protein in the central species (e.g. human), or, if a sequence is provided, from the closest homolog in the series of species used in the path.

Experimental verification of orthologs and paralogs

m6A (N6-methyladenosine RNA) is an mRNA modification that happens in vertebrates. To study the control and functional relevance of m6A in Drosophila melanogaster, we characterized CG7818 and Ime4, the Drosophila orthologs of the two (paralogous) human components of the RNA methyltransferase complex. CG7818 and Ime4 are also paralogous. A third paralog does not seem to form part of the complex neither in human nor in the fly. Spenito was another protein identified as part of the complex following mass-spectrometry. YT521-B (one of two Y2H m6A reader family fly paralogs) was identified as a nuclear reader of this modification whereas the other has a cytoplasmic location [11]. m6A was then found to regulate sex specification and neural development in the fly. In further work, we characterized protein Zc3h13 (in mouse) or Flacc (the Drosophila ortholog), a structural component that connects spenito to Wtap (in mouse) or Fl(2)d (in Drosophila), respectively [12]. In this way, the RNA binding capability of spenito is connected to the m6A machinery in a manner that is conserved in a wide evolutionary distance. Of note, the zinc finger in Zc3h13 (which gives the name to the protein) is absent in the Drosophila ortholog Flacc.

To find further genes responsible of m6A modification of RNA we performed a siRNA screen of Drosophila homologs of human methyltransferase-like proteins (METTLs) and identified CG9666, the ortholog of METTL5 [13]. Mettl5 is an ancient protein family with orthologs in most eukaryotic species (but fungi), in all currently completely sequenced archaebacteria, and in a few bacteria (likely due to horizontal transfer from archaea). This protein has an N-terminal methyltransferase domain and a C-terminal domain specific to the family. Functional characterization demonstrated that it places m6A on 18S ribosomal RNA while interacting with the Drosophila CG12975, ortholog of human TRMT112.

References

[1] Andrade, M.A., G. Casari, C. Sander and A. Valencia. 1997. Classification of protein families and detection of the determinant residues with an improved self-organizing map. Biol. Cybern. 76, 441-450

[2] Trelles, O., M.A. Andrade, A. Valencia, E.L. Zapata and J.M. Carazo. 1998. Computational space reduction and parallelization of a new clustering approach for large groups of sequences. Bioinformatics. 14, 439-451

[3] Perez-Iratxeta, C., G. Palidwor and M.A. Andrade-Navarro. 2007. Towards completion of the Earth's proteome. EMBO reports 8, 1135-1141

[4] Louis-Jeune, C., M.A. Andrade-Navarro and C. Perez-Iratxeta. 2015. FASTA Herder: a web application to trim protein sequence sets. ScienceOpen Research. 7, 1-4 [FASTA Herder]

[5] Mier, P. and M.A. Andrade-Navarro. 2019. Toward completion of the Earth’s proteome: an update a decade later. Briefings in Bioinformatics. 20, 46-470.

[6] Mier, P. and M.A. Andrade-Navarro. 2016. FastaHerder2: four ways to research protein function and evolution with clustering and clustered databases. J. Comp. Biol. 23, 270-278 [FastaHerder2]

[7] Mier, P. and M.A. Andrade-Navarro. 2016. CABRA: cluster and annotate BLAST results algorithm. BMC Research Notes. 9, 253 [CABRA]

[8] Mier, P., J.F. Fontaine, M. Stoldt, R. Libbrecht, C. Martelli, S. Foitzik and M.A. Andrade-Navarro. 2022. Annotation and analysis of 3902 odorant receptor protein sequences from 21 insect species provides insights into the evolution of odorant receptor gene families in solitary and social insects. Genes.13, 919. [iOrME]

[9] Mier, P., M.A. Andrade-Navarro and A.J. Pérez-Pulido. 2015. OrthoFind facilitates the discovery of homologous and orthologous proteins. PLoS ONE. 10, e0143906 [orthoFind]

[10] Mier, P., A.J. Pérez-Pulido and M.A. Andrade-Navarro. 2018. Automated selection of homologs to track the evolutionary history of proteins. BMC Bioinformatics. 19, 431. [ProteinPathTracker]

[11] Lence, T., J. Akhtar, M. Bayer, K. Schmid, L. Spindler, C. Hei Ho, N. Kreim, M.A. Andrade-Navarro, B. Poeck, M. Helm and J.Y. Roignant. 2016. m6A modulates neuronal functions and sex determination in Drosophila. Nature. 540, 242-247.

[12] Knuckles, P., T. Lence, I. Haussman, D. Jacob, N. Kreim, S.H. Carl, I. Masiello, T. Hares, R. Villaseñor, D. Hess, M.A. Andrade-Navarro, M. Biggiogera, M. Helm, M. Soller, M. Bühler and J.Y. Roignant. 2018. Zc3h13/Flacc is required for adenosine methylation by bridging the mRNA-binding factor Rbm15/Spenito to the m6A machinery component Wtap/Fl(2)d. Genes and Development. 32, 415-429.

[13] Leismann, J., M. Spagnuolo, L. Wacheul, M. Pradhan, M.A. Vu, M. Musheev, P. Mier, M.A. Andrade-Navarro, M. Graille, C. Niehrs, D.L.J. Lafontaine and J.Y. Roignant. 2020. The 18S ribosomal RNA m6A methyltransferase Mettl5 is required for normal walking behavior in Drosophila. EMBO Reports. 21, e49443.