Analysis of MEDLINE | Computational Biology and Data Mining

XplorMed

The XplorMed server (which was active 2001-2015) [1] allowed you to explore a set of abstracts derived from a MEDLINE search. The system gave the main associations between the words in groups of abstracts. Then, you could select a subset of your abstracts based on selected groups of related words and iterate your analisis on them.

We have presented the full algorithm, illustrated with examples of usage and a benchmark on a set of references associated to biomedical papers in [2]. A very detailed usage description has been shown in [3]. New functions added to the server during the years 2002 and 2003 are described in [4].

XplorMed was recommended for cases in which the user did not know exactly what was expecting to find. The interests of the users might be modified by the results obtained, or the user might want to enquire new questions as the analysis developed. Also, the results might suggest additional words that should be used to expand the query in MEDLINE (e.g., unexpected abbreviations of a protein name, or synonyms of a disease).

The XplorMed server was running at EMBL (Heidelberg, Germany) from 2001 to 2005. and at OHRI from 2005 to 2015 (last active URL http://xplormed.ogic.ca/). The only input needed was the result of a query in MEDLINE, the query itself, or a the database identifier of an entry containing links to MEDLINE.

Analysis of scientific publishing using the MEDLINE database

We have analysed the publishing trends by country in the MEDLINE database [5]. The obvious observation is that the amount of publications per inhabitant correlates clearly with the pertenence of the country to the 1st, 2nd, or 3rd world. A more sad result is that many of the countries at the bottom of the publication list are not even keeping their already low publishing activity. Scientific funding organisms should try to paliate this situation, for the sake of those countries because it implies that they will keep scientifically and technologically underdeveloped, and because it implies the waste of an intellectual task force.

We have observed that several grammatical parameters of the abstracts from MEDLINE depend on the mother tongue of the authors of the publication [6]. We illustrate that variation with examples and discuss about the consequences for communication between scientists.

We have characterized the topics of research in Bioinformatics and how they relate to other major topics of research in time such as genomics, proteomics, and computational terms [7]. Bioinformatics research and use has been facilitated by the popularization of computers and the internet, and its use expands now quicker than the use of computation. Using this analysis we observed that databases and sequence similarity analysis are the Bioinformatics topics most popular across the community of biomedical researchers.

To facilitate these types of analyses we have created a web tool (MLTrends) that graphs term usage in Medline versus time [8]. Terms can be individual words or quoted phrases which may be combined using Boolean operators to form a query. The number of records in Medline per year matching the query in titles and/or abstracts can be represented for each term. Previous indexing and local storage at the MLTrends site allows query times of less than a second, and therefore, rapid evaluation and comparison of the trends in research or language usage in biomedicine.

Selection of references to MEDLINE

We have developed a protocol that allows to extend and order by relevance a set of references to MEDLINE abstracts based on the content [9]. Given an existing selection of references, the first step consists of retrieving a number of MEDLINE neighbours (by cosine between word vectors, which is the method used at the NCBI's PubMed server). The second step ranks the abstracts based on a derivation of the keywords within, followed by an analysis of the relations between them. The abstracts containing the more related keywords are scored as more relevant. We tested the system with the SMART database of protein domains that has references attached to the database entries.

We developed a scoring system to rank all abstracts in MEDLINE according to a training set that can consist of tens of thousands of abstracts [10]. Algorithms such as neural networks or support vector machines are too computationally intensive for this task. We opted for a simpler solution that could be defined as text indexing. This is to measure the frequency of a word in a set of interesting abstracts (training set) and contrast it with the frequency of the word in the abstracts whole of MEDLINE. We can score abstracts by the ratio of frequencies of the words it contains, which allows us to rank the entire MEDLINE. We found that for the subject of stem cells nouns were better discriminants than adjectives or verbs. We obtained a precision and recall above 60% using nouns. We implemented and evaluated an improved version of this method using a linear naïve Bayesian classifier in a web server [MedlineRanker] [11]. This tool uses as input a set of MEDLINE abstracts, optionally a background to compare to, and outputs discriminant words and scored abstracts. We participated with MedlineRanker in the Biocreative III competition to demonstrate that it can be used to find articles dealing with protein interactions [12].

When one has to find references related to a single manuscript, relying on its text might not be enough. We demonstrate that using the whole set of references cited from the manuscript is effective in retrieving related literature [13]. Using selections of references depending on the section from which they are cited did not seem to improve performance.

Finding referees for a manuscript

We developed a web tool (peer2ref) that assigns potential referees to a manuscript using as input a small text, for example the abstract of the manuscript [14]. Keywords are extracted from this text and matched to profiles of authors of records in MEDLINE. Authors were diambiguated by coauthorship. The method allows the selection of subjects, which narrows the results to authors publishing in journals related to the selected subjects.

Mining of biomedical literature and copyright issues

With Carol Perez-Iratxeta (OHRI, Ottawa) we edited a special issue on text mining of the biomedical literature. In the introduction, with substantial input from Christoph Bruch (Helmholtz Open Access Coordination Office, Germany), we mention current copyright issues that obstruct text mining and how apparently open access policies from publishers are sometimes not so open [15].

References

[1] Perez-Iratxeta, C., P. Bork and M.A. Andrade. 2001. XplorMed: A tool for exploring MEDLINE abstracts. Trends Biochem Sci. 26, 573-575.

[2] Perez-Iratxeta, C., H.S. Keer, P. Bork and M.A. Andrade. 2002. Computing fuzzy associations for the analysis of biological literature. Biotechniques. 32, 1380-1385.

[3] Perez-Iratxeta, C., P. Bork and M.A. Andrade. 2002. Exploring MEDLINE abstracts with XplorMed. Drugs of Today. 38, 381-389.

[4] Perez-Iratxeta, C., A.J. Pérez, P. Bork and M.A. Andrade. 2003. Update on XplorMed: a web server for exploring scientific literature. Nucleic Acids Research. 31, 3866-3868.

[5] Perez-Iratxeta, C. and M.A. Andrade. 2002. Worldwide scientific publishing activity. Science. 297, 519.

[6] Netzel, R., C. Perez-Iratxeta, P. Bork and M.A. Andrade. 2003. The way we write. Country-specific variations of English in the scientific literature. EMBO Reports. 4, 446-451.

[7] Perez-Iratxeta, C., M.A. Andrade-Navarro and J.D. Wren. 2007. Evolving research trends in bioinformatics. Briefings in Bioinformatics. 8, 88-95.

[8] Palidwor, G., M.A. Andrade-Navarro. 2010.MLTrends: Graphing MEDLINE term usage over time. Journal of Biomedical Discovery and Collaboration. 5, 1-6. [MLTrends]

[9] Perez-Iratxeta, C., N. Astola, F.D. Ciccarelli, P.K. Sha, P. Bork and M.A. Andrade. 2003. A protocol for the update of references to scientific literature in biological databases. Applied Bioinformatics. 2, 189-191.

[10] Suomela, B.P. and M.A. Andrade. 2005. Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics. 6, 75.

[11] Fontaine, J., A. Barbosa-Silva, M. Schaefer, M.R. Huska, E.M. Muro and M.A. Andrade-Navarro. 2009. MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Research. 37, W141-W146. [MedlineRanker]

[12] Krallinger, M., M. Vazquez, F. Leitner, D. Salgado, A. Chatr-aryamontri, A. Winter, L. Perfetto, L. Briganti, L. Licata, M. Iannuccelli, L. Castagnoli, G. Cesareni, M. Tyers, G. Schneider, F. Rinaldi, R. Leaman, G. Gonzalez, S. Matos, S. Kim, W.J. Wilbur, L. Rocha, H. Shatkay, A.V. Tendulkar, S. Agarwal, F. Liu, X. Wang, R. Rak, K. Noto, C. Elkan, Z. Lu, R.I. Dogan, J.F. Fontaine, M.A. Andrade-Navarro and A. Valencia. 2011. The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics. 12 Suppl 8, S3.

[13] Ortuño, F., I. Rojas, M.A. Andrade-Navarro and J.F. Fontaine. 2013. Using cited references to improve the retrieval of related biomedical documents. BMC Bioinformatics. 14, 113.

[14] Andrade-Navarro, M.A., G.A. Palidwor and C. Perez-Iratxeta. 2012. Peer2ref: a peer-reviewer finder web-tool that uses author disambiguation. BioData Mining. 5, 14. [peer2ref]

[15] Andrade-Navarro, M. and C. Perez-Iratxeta. 2015. Text mining of biomedical literature: doing well, but we could be doing better. Methods. 74, 1-2.