Association of genes, proteins and chemicals with biological function

The public availability of the draft sequence of the human genome enables new strategies to map molecular functional features of gene products to complex phenotypic descriptions such as those of genetically inherited diseases. We have developed a scoring system for the possible functional relations of human genes to genetically inherited diseases that have been mapped onto chromosomal regions without assignment of a particular gene [1]. Our methodology can be divided in two parts: the association of genes to phenotypic features, and the identification of candidate genes on a chromosomal region by homology. The results of application of this methodology to a set of 455 human diseases, and additional information can be accessed through the G2D web server. The method was made fully available including updates of the databases used and a new set of analysis of 552 human diseases; we illustrated the use of the server to find candidate genes for a monogenic disease and for asthma, a complex disease [2].

We participated in a collaborative search for candidate genes for type 2 diabetes and the related trait obesity, which summarized the results of seven data mining methods (including G2D) [3]. This manuscript, besides offering a comprehensive set of genes for the study of the genetics of diabetes and obesity, serves as a description and comparison of these seven methods which should help researchers interested in using them. A similar approach was applied for the selection of 19 genes possibly associated to metabolic syndrome [4] using a focus on protein interactions between the products of the selected genes and in the selection of phenotypic features used to prioritize the genes. The resulting set of genes are predominantly involved in chylomicron processing, transmembrane receptor activity and signal transduction pathways.

In 2007 we published the update of G2D with two new methods of using evidence pointing to genes as candidates: using a gene known to be involved in a similar disease, and knowing the existence of two genomic regions bearing genes involved in the disease [5]. The second method uses protein-protein interactions as described by the STRING server at EMBL-Heidelberg.

We have used G2D to prioritize genes for association to asthma in a Canadian cohort in two suggestive loci shown to be linked with asthma (6q26) and atopy (10q26.3); best candidates were genotyped and PTPRE was found to have a protective association to allergic asthma (p = 0.000463; corrected p = 0.0478) [6].

In [7] we review different computational approaches that use database and genomics information to suggest associations of genes to human disease.

The availability of more than 100 bacterial complete genomes in the year 2003 allowed already the application of several methods that make functional predictions on genes of unknown functionality based on gene order, fusion events, and phylogenetic profiles (implemented for public usage in the STRING server at EMBL-Heidelberg). We applied these methods to genes predicted to be transcription factors but lacking functional evidence (36 orthologous groups) [8]. We were able to assign a regulatory function for 18 of them. For some of the rest, it is possible to assign possible groups of related genes, likely constituting yet undiscovered pathways or signaling cascades.

Genes and proteins are annotated in databases by means of keywords chosen from systems of keywords (such as Gene Ontology, or the SwissProt keywords). Ultimately, they must be derived from the scientific literature associated to the sequence. Two main problems arise when doing such an annotation: the keyword system can change, and the literature associated to an entry can expand. This implies frequent revisions of the annotations, and the derivation of keywords from the literature is not a trivial process. In order to help this task, we developed a way to produce mappings between systems of keywords that can be used to suggest keywords from a given system upon selection of a number of scientific references [9]. The system was implemented for public usage as the KAT (Keyword Annotation Tool) web server (hosted at EMBL from 2003 until 2011). The server suggested both SwissProt keywords and Gene Ontology terms using as an input a set of references to MEDLINE.

We have applied similar methods to the annotation of Affymetrix DNA microarray probe sets with GO terms [10]. Those GO terms were transferred from entries linked to the probe sets in the NetAffx database distributed by Affymetrix (Sptrembl, Interpro), or inferred from mappings between keyword-systems using links between databases (Sptrembl, MEDLINE). The extracted GO terms are available from the Probe2GO web server.

We developed a method to suggest function to prokaryotic genes according to their phylogenetic distribution [11]. This uses the analysis of the text of MEDLINE abstracts associated to the phylogenetic pattern. Words (nouns) over-represented in the set of abstracts are reported as possibly associated to the corresponding genes. The method is illustrated with predictions of known and novel properties. Studies like this are complicated by the lack of resources collecting species traits. To facilitate such analyses, we started the Traitpedia, a database to collect species associated traits using a simple format and encouraging contributions from the research community [12].

We developed a method to prioritize the complete set of genes from a species according to a topic defined by the user (Génie [13]). Topic is defined by a query to MEDLINE, from which 1000 abstracts are used to train a classifier that then scores the complete MEDLINE. Genes are then scored according to their associations to scored MEDLINE records. Optionally, one can score genes of an organism (e.g. zebrafish) according to their orthologs in other species (e.g. human and mouse). This is advantageous when the organism is used as a model (e.g. zebrafish to model human heart development) and there is no bibliography attached to those genes in the topic of interest. Using a similar method we prioritize chemicals according to user defined biomedical topics (Alkemio [14]).

The reverse approach, to find the topics overrepresented in sets of genes has been very well exploited already for function (GO terms) or pathways, but not so much for diseases. However, this is a typical question in the interpretation of experimental results: e.g. given a set of differentially expressed genes, are some of these genes associated to a particular human disease, e.g. immune disease? The annotations by NCBI of PubMed records already allow a very complete association of diseases to genes following their links to PubMed records. We developed a method and web tool (GeneSet2Diseases) that uses these links to evaluate sets of human genes very fast [15]. Links to the bibliography are provided, which allow easily understanding the source of the associations between genes and diseases that the enrichment analysis finds. The method uses precomputed associations and thus works in seconds allowing users to try many sets and parameters. We extended this approach to associate sets of lipids with diseases [16]. The method is available via a web tool (LipiDisease), which includes the possibility to compute lipid-set associated disease enrichment from lists of lipids with fold changes comparing concentrations between two conditions. To investigate specifically in more detail associations between lipids and Cardiovascular Disease (CVD) in the biomedical literature, we collected publications dealing with CVD in human and mouse (as major model organism) with a focus on physiological research (mentioning plasma, heart or myocardium) and collected the lipids associated in PubMed [17]. We highlight the shift of the field in recent years from lipids as markers towards more mechanistic insights into their role in CVD. Using known connections between (human plasma) proteins and the lipids they metabolize, we built a network that we used to identify the proteins most connected to the lipids collected from the bibliography. We conclude that lipid-focused research on proteins such as Prostaglandin G/H synthase 2 (PTGS2, a.k.a. COX2) and Acylglycerol kinase (AGK) would bring insight into the mechanisms of CVD.

We showed that the annotation of diseases with words describing their phenotypes by a clinician can be supported by a text mining procedure. We then showed that its application to a small number of neuropsychiatric disorders can be successfully expanded for the automated annotation of a larger set of disorders, finding novel associations of genes and drugs to disorders and a correlation between gene functional class and how specific is the association of a gene to the set of disorders [18].

References

[1] Perez-Iratxeta, C., P. Bork and M.A. Andrade. 2002. Association of genes to genetically inherited diseases using data mining. Nature Genetics. 31, 316-319. [G2D server]

[2] Perez-Iratxeta, C., M. Wjst, P. Bork and M.A. Andrade. 2005. G2D: A Tool for Mining Genes Associated to Disease. BMC Genetics. 6, 45. [G2D server]

[3] Tiffin, N., E. Adie, F. Turner, H.G. Brunner, M.A. van Driel, M. Oti, N. Lopez-Bigas, C. Ouzonis, C. Perez-Iratxeta, M.A. Andrade-Navarro, A. Adeyemo, M.E. Patti, C. Semple and W. Hide. 2006. Computational disease gene identification: a concert of methods prioritises type 2 diabetes and obesity candidate genes. Nucleic Acids Research. 34, 3067-3081.

[4] Tiffin, N., I. Okpechi, C. Perez-Iratxeta, M.A. Andrade-Navarro, R. Ramesar. 2008. Prioritisation of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes. Physiological Genomics. 35, 55-64.

[5] Perez-Iratxeta, C., P. Bork and M.A. Andrade-Navarro. 2007. Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Research. 35, W212-W216. [G2D server]

[6] Tremblay, K., M. Lemire, C. Potvin, A. Tremblay, G.M., Hunninghake, B.A. Raby, T.J. Hudson, C. Perez-Iratxeta, M.A. Andrade-Navarro and C. Laprise. 2008. Genes to Diseases (G2D) computational method to identify asthma candidate genes. PLoS ONE. 3, e2907.

[7] Tiffin, N., M.A. Andrade-Navarro and C. Perez-Iratxeta. 2009. Linking genes to diseases: it's all in the data. Genome Medicine. 1, 77.

[8] Doerks, T., M.A. Andrade, W. Lathe 3rd, C. von Mering and P. Bork. 2004. Global analysis of bacterial transcription factors to predict cellular target processes. Trends in Genetics. 20, 126-131.

[9] Pérez, A.J., C. Perez-Iratxeta, P. Bork, G. Thode and M.A. Andrade. 2004. Gene annotation from scientific literature using mappings between keyword systems. Bioinformatics. 20, 2084-2091.

[10] Muro, E.M., C. Perez-Iratxeta, and M.A. Andrade-Navarro. 2006. Amplification of the Gene Ontology annotation of Affymetrix probe sets. BMC Bioinformatics. 7, 159. [Probe2GO]

[11] Korbel, J.O., T. Doerks, L.J. Jensen, C. Perez-Iratxeta, S. Kaczanowski, S.D. Hooper, M.A. Andrade and P. Bork. 2005. Systematic association of genes to phenotypes by genome and literature mining. PLoS Biology. 3, e134.

[12] Mier, P. and M.A. Andrade-Navarro. 2019. Traitpedia: a collaborative effort to gather species traits. Bioinformatics. 35, 1079-1081. [Traitpedia]

[13] Fontaine, J.F., F. Priller, A. Barbosa-Silva, M.A. Andrade-Navarro. 2011. Génie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Research. 39:W455-W461. [Génie]

[14] Gijón-Correas, J.A., M.A. Andrade-Navarro and J.F. Fontaine. 2014. Alkemio: association of chemicals with biomedical topics by text and data mining. Nucleic Acids Research. 42, W422-W429. [Alkemio]

[15] Andrade-Navarro, M.A. and J.F. Fontaine. 2016. Gene set to Diseases (GS2D): disease enrichment analysis on human gene sets with literature data. Genomics and Computational Biology. 2, 33. [GeneSet2Diseases]

[16] More, P., L. Bindila, P. Wild, M.A. Andrade-Navarro, J.F. Fontaine. 2021. LipiDisease: associate lipids to diseases using literature mining. Bioinformatics. 6, btab559. [LipiDisease]

[17] Anyaegbunam, U.A., P. More, J.F. Fontaine, V. ten Cate, K. Bauer, U. Distler, E. Araldi, L. Bindila, P. Wild and M.A. Andrade-Navarro. A systematic review of lipid-focused cardiovascular disease research: trends and opportunities. Curr. Issues Mol. Biol. 45, 9904-9916.

[18] Fontaine, J.F., J. Priller, E. Spruth, C. Perez-Iratxeta, M.A. Andrade-Navarro. 2015. Assessment of curated phenotype mining in neuropsychiatric disorder literature. Methods. 74, 90-96.