Transcript prediction | Computational Biology and Data Mining

We observed that the databases of expressed sequence tags (ESTs) contain abundant evidence pointing to alternative 3'UTR ends currently absent of the databases for many genes, and invalidating many of the database transcript ends. We proposed and verified a method to predict transcript ends using EST data and analysis of poly-adenylation signals [1]. We made available the results of our analysis of the complete human and murine genomes through the Transcriptome Sailor web tool, which allowed examination of particular genomic regions for predictions and evidence.

It has been observed that there is a number of non-coding RNAs antisense from other normally longer transcripts (natural antisense transcripts, NATs). The function and biological relevance of NATs is under debate. Some NATs overlap their antisense (cis-NATs) but some do not (trans-NATs). The origin of trans-NATs is puzzling since their generation requires the formation of a large region of complementarity to a distant gene. We tested the hypothesis that gene duplication evolving into a pseudogene could result in a transcript expressed antisense to the parental gene, a trans-NAT [2]. To test this hypothesis we analysed the EST data originating from human pseudogenes and observed a significant number of transcripts in pseudogenes, which had a region of high conservation to the parental gene near their putative 3' ends.

By comparing Serial Analysis of Gene Expression (SAGE) libraries with equivalent gene expression profiles by cDNA microarrays from samples from mouse stem cells and derivatives, we could quantify 48 SAGE tags evidencing possible anti-sense transcripts differentially expressed in stem cell differentiation that overlapped 40 sense genes [3]. Patterns of expression of sense and anti-sense transcripts suggest that these antisense transcripts modulate the expression of the sense gene.

We developed a novel approach to detect non-coding RNAs (ncRNAs) combining data from the large set of over 60 million of cDNA based Expressed Sequence Tags (ESTs) available in the NCBI dbEST database and the current model of miRNA biogenesis [4]. Significant evidence of processed miRNAs resides in cDNA derived EST data with traces of EST termini near the 3' ends of miRNAs in the mouse and human genomes. We used this property to filter imprecise results from both computational and empirical methods (i.e. Next-Generation Sequencing) of ncRNA prediction and obtained major gains in miRNA prediction accuracy and recall. In addition, since EST libraries describe the tissues, cell types, and conditions in which the ESTs were observed we used such data to predict tissue-specific miRNAs. To illustrate the application of this approach, we tested hundreds of predicted ncRNAs in differentiating myoblasts and mES cells using a tiling microarray, both identifying expected miRNAs and discovering novel ones, including potential repressors of myogenic factors Pax7 and Myf5 which are key regulators of muscle differentiation.

In a review [5] we considered the historical evolution of the research on pseudogene function, with a focus on their role as post-transcriptional regulators of the corresponding parental genes from which they originate, directly by means of siRNAs they encode or indirectly as decoys of miRNAs that target the parental gene.

In a review [6] we discussed the evolution and current status of the computational methods that are used to predict ncRNAs or to evaluate experimental results identifying ncRNAs.

We developed an alignment method specific to detect regions in non-coding DNA with similarity to protein coding genes [7]. This method benefits from a substitution matrix that we use to compare three-frame translations of non-coding DNA against proteins. The similarity score is modelled for random mutations. Significance of the alignments is decided using the Rost curve and analysis is supported with visualization scripts [8]. Application of this method to human lincRNAs detected 203 transcripts with significant similarity to protein-coding genes, suggesting regulatory functions for these lincRNAs. Taking advantage of these associations, we created an online tool (DiseaseLinc, [9]) that associates diseases to lincRNAs. The tool is based in the data mining of significant associations of genes with diseases derived from PubMed and associated as proxy functions to lincRNAs by their parental genes. The associations can be exploited to obtain functional enrichment of lists of differentially expressed genes, or simply to selected sets. The data can be browsed to associate individual lincRNAs with diseases and viceversa. Links to PubMed records and genes supporting the inference of the associations provide insight about those.

We also contributed to a method and associated web tool (AnABlast) that detects potential coding regions in DNA by running the standard BLASTX algorithm to compare all translated frames of the query DNA sequence against a protein database [10]. Graphical display of the accumulated hits can then be used to point to regions potentially coding, but that could be remains of protein coding genes as well. While using AnABlast, we discovered an accumulation of hits at a CRISPR sequence. CRISPR-Cas loci occur in prokaryotic genomes and include a series of interspaced short (20-60 bp) palindromic repeats. Erroneous translation of open reading frames at this loci result in spurious protein sequences in protein databases [11]. We attempted to characterize these errors (proposing the removal of 1,341 proteins from the database) and provide a protocol that involves comparing new protein sequences with a database of CRISPR repeats complemented with a search for cas genes in the genomic nighbourhood.

References

[1] Muro, E.M., R. Herrington, S. Janmohamed, C. Frelin, M.A. Andrade-Navarro, and N.N. Iscove. 2008. Targeting probes to gene 3'-ends by automated EST cluster analysis. Proc. Natl. Acad. Sci. 150, 20286-20290.

[2] Muro, E.M., M.A. Andrade-Navarro. 2010. Pseudogenes as an alternative source of natural antisense transcripts. BMC Evolutionary Biology. 10, 338.

[3] Sandie, R., Porter, C.J., G.A. Palidwor, F. Price, P.M. Krzyzanowski, E.M. Muro, S. Hoersch, M. Smith, P.A. Campbell, C. Perez-Iratxeta, M.A. Rudnicki, M.A. Andrade-Navarro. 2012. Paired SAGE-microarray expression data sets reveal antisense transcripts differentially expressed in embryonic stem cell differentiation. [Book chapter]. In Computational Biology of Embryonic Stem Cells. M. Zhang (Ed.). Bentham Scientific Publishers. pp. 193-215

[4] Krzyzanowski, P.M., F.D. Price, E.M. Muro, M.A. Rudnicki, M.A. Andrade-Navarro. 2011. Integration of expressed sequence tag data flanking predicted RNA secondary structures facilitates novel non-coding RNA discovery. PLoS One. 6, e20561.

[5] Muro, E.M., N. Mah and M.A. Andrade-Navarro. 2011. Functional evidence of post-transcriptional regulation by pseudogenes. Biochimie. 93, 1916-1921.

[6] Krzyzanowski, P.M., E.M. Muro and M.A. Andrade-Navarro. 2012. Computational approaches to discovering non-coding RNA. Wiley Interdisciplinary Reviews: RNA. 3, 567-579.

[7] Talyan, S, M.A. Andrade-Navarro and E.M. Muro. 2018. Identification of transcribed protein coding sequence remnants within lincRNAs. Nucleic Acids Research. 46, 8720-8729.

[8] Talyan, S. M.A. Andrade-Navarro and E.M. Muro. 2021. A methodology to study pseudogenized lincRNAs. Methods Mol. Biol. 2324, 49-63.

[9] More, P., S. Talyan, J.F. Fontaine, E.M. Muro and M.A. Andrade-Navarro. 2021. DiseaseLinc: disease enrichment analysis of sets of differentially expressed lincRNAs. Cells. 10, 751. [DiseaseLinc]

[10] Rubio, A., C.S. Casimiro-Soriguer, P. Mier, M.A. Andrade-Navarro, A. Garzón, J. Jiménez and A.J. Pérez-Pulido. 2018. AnABlast, re-searching for protein-coding sequences in genomic regions. Methods Mol. Biol. 1962, 207-214. [AnABlast]

[11] Rubio, A., P. Mier, M.A. Andrade-Navarro, A. Garzón, J. Jiménez and A.J. Pérez-Pulido. 2020. CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats. Database. 2020, baaa088.