Information extraction of biological features from text

We developed a prototype for the automatic annotation of functional characteristics in protein families [1]. The system extracted information from MEDLINE abstracts. Relevant keywords were selected according to the difference between their frequency in the family object of analysis and their frequency in other unrelated protein families [2]. The system was available as the AbXtract web tool at EMBL from 1998 until 2004.

We assayed the automatic extraction of a higher level protein functional information from abstracts: protein-protein interactions [3]. The system looks for sentences containing a given order of protein names and verbs indicative of interaction. In a later approach, we developed a system to do this task, LAITOR, which uses dictionaries of gene/proteins and bio-interactions, and methodologies to recognize protein and gene names from text [4]. LAITOR classifies gene/protein co-occurrences in sentences according to several levels depending on the relative position of the terms detected. We applied this approach in combination with MedlineRanker to develop a gene chart for the embryonic preimplantation stage [5]. A graphical implementation of LAITOR can be used in PESCADOR [6]; it allows to select interactions related to terms defined by the user.

I have developed an algorithm for deriving position-specific protein functional annotations [7]. The input is based on the results of a sequence similarity search of a query sequence against a sequence database. Strings of words are extracted from the descriptions of the proteins, and the correlation between proteins having the same descriptors and amino acid conservation is used to compute a score that indicates which descriptor is likely to best describe the function of each particular residue. Immediate applications of this algorithm are, support for (automated) methods of protein functional annotation, and database coherency checking.

We wrote an invited review about data mining techniques applied to molecular biology, especially those that extract information from MEDLINE abstracts [8]. We illustrated the possibilities of these techniques with the application of a keyword extraction and abstract selection procedure to a database of human diseases (OMIM, On-line Mendelian Inheritance in Man).

We extracted keywords from full text scientific articles and analysed the distribution of keywords (by density and subject) [9]. We found that although the abstract of a publication contains a high ratio of keywords to total words, many keywords not present in the abstract can be found in the rest of the paper. We also detected that the context of keywords can change, in particular in the Methods section, where, for example, gene names are usually related more to methodologies than to biological phenomena.

We explored the use of social media (Twitter) to extract information about side effects of drugs (medicines) [10]. We used crowd sourcing (Amazon mechanical Turk) to obtain labeled data of high quality, that is, human curators evauated sentences as describing the side effect of a medicine or not, thus generating a high quality dataset that we used for training machine learning algorithms. Using this approach, we found unreported potential side effects for Naproxen (depression, anxiety and sleep disturbances), a drug used to alleviate pain and fever.


[1] Andrade, M.A. and A. Valencia. 1997. Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. ISMB 97. 5, 25-32.

[2] Andrade, M.A. and A. Valencia. 1998. Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families. Bioinformatics. 14, 600-607.

[3] Blaschke, C., M.A. Andrade, C. Ouzounis and A. Valencia. 1999. Automatic extraction of biological information from Scientific text: protein-protein interactions. ISMB 99. 7, 60-67.

[4] Barbosa-Silva, A., T.G Soldatos, I.L.F. Magalhães, G.A. Pavlopoulos, J.F. Fontaine, M.A. Andrade-Navarro, R. Schneider, J.M. Ortega. 2010. LAITOR – literature assistant for identification of terms co-occurrences and relationships. BMC Bioinformatics. 11, 70.

[5] Donnard, E., A. Barbosa-Silva, R.M. Guedes, G. Fernandes, H. Velloso, M.J. Kohn, M.A. Andrade-Navarro, J.M. Ortega. 2011. Preimplantation development regulatory pathway construction through a text-mining approach. BMC Bioinformatics. 12 Suppl 4, S3.

[6] Barbosa-Silva, A., J.F. Fontaine, E.R. Donnard, F. Stussi, J. Miguel-Ortega and M.A. Andrade-Navarro. 2011. PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries. BMC Bioinformatics. 12, 435. [PESCADOR]

[7] Andrade, M.A. 1999. Position-specific annotation of protein function based on multiple homologs. ISMB 99. 7, 28-33.

[8] Andrade, M.A. and P. Bork. 2000. Automated extraction of information in molecular biology. FEBS Letters. 476, 12-17.

[9] Shah, P.K., C. Perez-Iratxeta, P. Bork and M.A. Andrade. 2003. Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics. 4, 20.

[10] Burkhardt, S., J. Siekiera, J. Glodde, M.A. Andrade-Navarro and S. Kramer. 2019. Towards identifying drug side effects from social media using active learning and crowd sourcing. Pac. Symp. Biocomput. 25, 319-330.