Identification of protein tandem repeats | Computational Biology and Data Mining

Tandem repeats (TRs) in protein sequences generate folding structural units that assemble together forming open elongated flexible structures or closed structures (barrels and propellers) [1]. They can evolve quickly because duplication or deletion of a repeat unit is likely to result in a structure that can fold as well. They generate structures that are flexible and have large surfaces, good for interactions with proteins, and thus are often found in large protein complexes. The evolutionary constraints on the sequences of TRs may be lower than for globular proteins as the structure depends on fewer interacting residues; this can cause large divergence in the sequence of the repeats, which complicates their detection by sequence similarity.

The discovery of HEAT repeats

We found the first homology of the Huntington's disease protein to other protein sequence [2]. This protein contains a repeat of around 40 amino acids which, at the time, was already described for the alpha subunit of the protein phosphatase 2A. We found and characterised this repeat in a number of eukaryotic cytoplasmic proteins mainly involved in cytoplasmic transport processes and most of them known to be part of protein complexes.

Methods for identification of repeats

We developed a method [3] for identification of short protein repeats (between 20-40 amino acids long). These repeats are usually very divergent and their recognition is difficult even if having a good profile of the repeat. We observed that the scores of optimal and sub-optimal non-overlapping alignments of a repeat profile against a large database of randomized sequences follow Extreme Value Distributions (EVDs). From the analysis of those EVDs we can associate E-values to multiple non-overlapping hits of a profile repeat against a query sequence. We tested the method for eleven repeat families in the whole SwissProt database, Saccharomyces cerevisiae, Caenorhabditis elegans and Homo Sapiens proteins. We could detect new unrecognised repeats and unify some repeat families. The method is available as the web server REP. An update of the server introduced the possibility to analyse proteins in multiple sequence alignments [4], which is helpful to add support for weak hits by comparison to orthologs. Using such comparisons across multiple organisms, we were able to assess the evolutionary trends of structural repeats in eukarya.

The previous work showed the difficulty of classifying ARM and HEAT repeats (which occur in at least 1 in 500 eukaryotic protein sequences). They are similar in sequence and structure but we could not account for both of them with a single profile. We have reviewed these repeats [5] correlating sequence similarity between repeats to functional and structural properties. Several profiles were built that improved their detection. They can be used for scanning protein sequences through the REP server.

We developed a neural network based method [ARD] to detect repeats like HEAT, Armadillo, and PBS, that form similar structures composed of alpha-helices (which we termed alpha-rods) [6]. Using this method allowed detecting novel instances of this structure, for example in human proteins STAG1-3, SERAC1, and PSMD1-2 & 5. Application of the method to human huntingtin and comparison to orthologs allowed us to delimit three alpha-rods in huntingtin whose intra-molecular interactions we characterized experimentally using yeast two hybrid and co-immunoprecipitation of protein fragments encoding the domains. We updated the method by allowing the detection of repeats with an internal linker of variable length ([ARD2], [7]). Using ARD2 we evaluated novel structures and the phylogenetic distribution of these repeats, pointing to multiple likely events of independent emergence of these repeats in distant taxa and to their increased frequency in organisms of high cellular complexity such as eukarya in general, and cyanobacteria and planctomycetes within prokarya.

We developed a method and a web tool to identify duplications of protein short tandem repeats (pSTRs) from protein pairwise alignments ([pSTR], [8]). Study of orthologs from 12 complete metazoan proteomes suggests that at least 3% of amino acids in sequences are covered by pSTRs. We identified protein families with higher frequency of pSTR variation, particularly of proteins involved in liquid-liquid phase separation, suggesting that evolutionary pressure for repeat unit variation could be associated with particular protein functions.

Identification of novel repeats

We have analysed a large protein family of the Arabidopsis thaliana plant genome [9]. This family contains at least 48 proteins of yet unknown function. We identified Kelch repeats (implied in protein-protein interactions) and an F-box domain (which targets proteins for degradation). The demonstration of the in vivo interaction of one of the members of the family with ASK1 (homolog of yeast Skp1p, a subunit of the SCF complex which is involved in the ubiquitination of proteins prior degradation by the 26S proteasome) via the F-box domain, gave some insights into the functionality of this family.

Protein repeats that form structural repeating units that assemble together are quite common in many protein families and organisms. In an invited review we discuss the analysis of such repeats (including computational characterization) and how we think that repetition in protein sequence relates to evolution and function [10].

We identified a protein domain that appears with variable copy number in genes that are usually in the vicinity of a putative Fe3+ siderophore transporter [11]. We denoted this new domain NEAT for NEAr Transporter. Given that this domain seems to be specific of pathogenic bacteria, we suggest that it is a potential target for therapy against disease.

We participated in the characterization of microtubule associated AIR9, a protein that in plants associates to the microtubules of the cortical cells during preprophase and when the plant cortex is contacted by the cell plate (a plant-specific cellular structure that forms during cell division) [12]. This protein contains homologs in trypanosomatid parasites featuring a region with leucin reach repeats and a number of protein TRs. We termed these repeats A9, characterized them in plant, trypanosomes, and bacterial sequences, and predicted them to adopt an immunoglobulin fold. We discussed the phylogeny of the AIR9 proteins with novel sequence evidence and discuss the especial amino acid bias in the plant members of this family [13].

Periostin is a protein of the extracellular matrix. Despite its proven association to bone and heart development and to cancer, its function currently remains elusive. By sequence and database analyses we characterized the variability of Periostin's C-terminal in terms of exon count, length, and alternative splicing, and the existence of a 13-amino acid repeat that we predict to form consecutive beta strands [14]. These findings are put in the context of functional and structural predictions.

In some situations, even after resolution of a protein's 3D structure, the definition of protein repeats may be under debate. For example, we clarified the presence of armadillo repeats in p115, a structural component of the Golgi apparatus that facilitates the tethering of transport vesicles inbound from the endoplasmic reticulum to the cis-Golgi membrane, following conflicting interpretations of its structure [15].

We characterized a region of 15 repeats of around 10 amino acids in the human mineralocorticoid receptor (MR) [16]. The MR is part of the renin angiotensin aldosterone system (RAAS). This protein has an inhibitory domain of unknown structure. We predict that the repeats region adopts a beta-solenoid structure and propose how this could be involved in phosphorylation dependent inter- and intra-molecular interactions.

Using sequence similarity analyses, we identified a region of TRs covering the C-terminal 2/3 of the TPX2 protein [17]. TPX2, conserved in plants and chordata, is essential for spindle pole formation and controls the nucleation of microtubules on chromosomes during mitosis. There was so far no structural information about this protein. Using structure predictions we support that the region of the repeats forms an alpha helical solenoid, which we support with CD spectra that indicates high alpha-helical content in Xenopus (frog) and Arabidopsis (plant) TPX2.

Tandem repeats in protein structures and interactions

RepeatsDB is a database of protein TRs of known structure derived from protein 3D structures. The Repeats DB 2.0 update included annotations from more than 5400 structures, 60% of them manually curated [18]. Repeats are classified in five categories according to their length and general arrangement, with subclasses that depend on secondary structure content. The Repeats DB 3.0 update extended the classification scheme separating three hierarchical levels based on structural similarity (class, topology and fold) from the two lower levels that consider sequence similarity, clan, for repeat motifs, and family, which considers homology [19].

While in theory encoding repeat units in separate exons would make them easier to duplicate, it is not the case that most TRs are encoded in this way. On the other hand, observing the cases where this happens can help defining what are the structural units forming a TR and their phase. To approach these questions we characterized the correspondences between exon boundaries and structures for a number of TR proteins [20]. Different types of repeats have different behaviours with some being more prone to have a high correspondence; encoding of two consecutive TRs was an often observed feature. Such observations should help the detection and classification of TRs.

Because the definition of Tandem Repeats in Proteins (TRPs) is a bit too general, we proposed that it is necessary to define a sub-category of Structural TRPs (STRPs) as those that have a structure that can be solved experimentally or predicted [21]. We point to other properties in which STRPs are different respect to general TRPs: for example, having complex sequence composition and low sequence similarity between repeats.

TRs that adopt structures can form open or closed ensembles. Open ensembles (solenoids) have been found to be often used as flexible protein scaffolds for interaction with many proteins. We revised different aspects of the sequence, structural and evolutionary properties of TRs that make them advantageous for a function in protein interaction [22]. Solenoids of TRs behave like globular domains in their interactions because they are very structured, being very different from disordered regions that interact via motifs, while being flexible in the long range. We illustrate two examples of possible TR duplication events in Plasmodium but the scarcity of such events suggests that the ensemble of TRs reaches an optimal length in an ancestral sequence and this becomes rather fixed in evolution; however, partners can evolve to adapt to and twist the flexible solenoid.

Repeats in giant viruses

Considering that the mechanisms of TR duplication (and deletion) must result from the replication machinery, we wondered if viruses, which use the replication machinery of the host, would be able to use it to gain repeats in their proteins and if this would induce some biases to the types of repeats found in viruses. For this analysis, we studied the proteomes of Nucleocytoplasmatic large DNA viruses (NCLDVs or giant viruses) because they encode hundreds of proteins [23]. We found all repeat lengths in their proteins, from homorepeats, to short TRs (leading to composition bias and unlikely to gain structure), to larger structured repeats. All repeats were found to dynamically evolve within the viral lineages; both homorepeats and short TRs emerge in viruses while structured repeats were adopted by horizontal transfer. We conclude that giant viruses are good models for the study of the evolution of TRs.

References

[1] Andrade, M.A., C. Perez-Iratxeta, and C.P. Ponting. 2001. Protein repeats: structures, functions and evolution. Journal of Structural Biology. 134, 117-131.

[2] Andrade, M.A. and P. Bork. 1995. HEAT repeats in the Huntington's disease protein. Nature Genetics, 11, 115-116.

[3] Andrade, M.A., C.P. Ponting, T.J. Gibson and P. Bork. 2000. Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol. 298, 521-537. [REP]

[4] Kamel, M., K. Kastano, P. Mier and M.A. Andrade-Navarro. 2021. REP2: a web server to detect common tandem repeats in protein sequences. J. Mol. Biol. 433, 166895. [REP]

[5] Andrade, M.A., C. Petosa, S.I. O'Donoghue, C.W. Müller and P. Bork. 2001. Comparison of ARM and HEAT repeat proteins. J. Mol. Biol. 309, 1-18.

[6] Palidwor, G.A., S. Shcherbinin, M.R. Huska, T. Rasko, U. Stelzl, A. Arumughan, R. Foulle, P. Porras, L. Sanchez-Pulido, E.E. Wanker, M.A. Andrade-Navarro. 2009. Detection of alpha-rod repeats using a neural network and application to huntingtin. PLoS Comp. Biol. 5, e1000304. [ARD].

[7] Fournier, D., G.A. Palidwor, S. Shcherbinin, A. Szengel, M.H. Schaefer, C. Perez-Iratxeta and M.A. Andrade-Navarro. 2013. Functional and genomic analyses of alpha-solenoid proteins. PLoS One. 8, e79894. [ARD2].

[8] Mier, P. and M.A. Andrade-Navarro. 2023. Evolutionary study of protein short tandem repeats in protein families. Biomolecules. 13, 1116. [pSTR]

[9] Andrade, M.A., M. González-Guzmán, R. Serrano and P.L. Rodríguez. 2001. A combination of the F-box motif and kelch repeats defines a large Arabidopsis family of F-box proteins Plant Mol. Biol. 46, 603-614.

[10] Andrade, M.A., C. Perez-Iratxeta, and C.P. Ponting. 2001. Protein repeats: structures, functions and evolution. Journal of Structural Biology. 84, 445-451.

[11] Andrade, M.A., F.D. Ciccarelli, C. Perez-Iratxeta and P. Bork. 2002. NEAT: A domain duplicated in genes near the components of a putative Fe3+ siderophore transporter from Gram-positive pathogenic bacteria. Genome Biology. 3, research0047.1-0047.5.

[12] Buschmann, H., J. Chan, L. Sanchez-Pulido, M.A. Andrade-Navarro, J.H. Doonan and C.W. Lloyd. 2006. Microtubule associated AIR9 recognizes the cortical division site at preprophase and again when the cell plate inserts. Current Biology. 2, 296-299.

[13] Buschmann, H., L. Sanchez-Pulido, M.A. Andrade-Navarro and C.W. Lloyd. 2007. Homologues of Arabidopsis microtubule-associated AIR9 in trypanosomatid parasites: hints on evolution and function. Plant Signaling & Behavior. 16, 1938-1943.

[14] Hoersch, S. and M.A. Andrade-Navarro. 2010. Periostin shows increased evolutionary plasticity in its alternatively spliced region. BMC Evolutionary Biology. 10, 30.

[15] Striegl, H., M.A. Andrade-Navarro, U. Heinemann. 2010. Armadillo motifs involved in vesicular transport. PLoS ONE. 5, e8991.

[16] Vlassi, M., K. Brauns and M.A. Andrade-Navarro. 2013. Short tandem repeats in the inhibitory domain of the mineralocorticoid receptor: prediction of a ß-solenoid structure. BMC Structural Biology. 13, 17.

[17] Sanchez-Pulido, L., L.H. Perez, S. Kuhn, I. Vernos and M.A. Andrade-Navarro. 2016. The C-terminal domain of TPX2 is made of alpha-helical tandem repeats. BMC Structural Biology. 16, 17.

[18] Paladin, L., L. Hirsch, D. Piovesan, M.A. Andrade-Navarro, A.V. Kajava and Silvio C.E. Tosatto. 2016. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures. Nucleic Acids Research. 45, 3613. [RepeatsDB]

[19] Paladin, L., M. Bevilacqua, S. Errigo, D. Piovesan, I. Mičetić, M. Necci, A.M. Monzon, M.L. Fabre, J.L. Lopez, J.F. Nilsson, J. Rios, P. Lorenzano Menna, M. Cabrera, M. Gonzalez Buitron, M. Gonçalves Kulik, S. Fernandez-Alberti, M.S. Fornasari, G. Parisi, A. Lagares, L. Hirsh, M.A. Andrade-Navarro, A.V. Kajava, and S.C.E. Tosatto. 2021. RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures. Nucleic Acids Research. 49, D452-D457. [RepeatsDB]

[20] Paladin, L., M. Necci, D. Piovesan, P. Mier, M.A. Andrade-Navarro and S.C.E. Tosatto. 2020. A novel approach to investigate the evolution of structured tandem repeat protein families by exon duplication. J. Struc Biol. 212, 107608.

[21] Monzon, A.M., P.N. Arrías, A. Elofsson, P. Mier, M.A. Andrade-Navarro, M. Bevilacqua, D. Clementel, A. Bateman, L. Hirsh, M.S. Fornasari, G. Parisi, D. Piovesan, A.V. Kajava, S.C.E. Tosatto. 2023. A STRP-ed definition of Structured Tandem Repeats in Proteins. J. Struct. Biol. 215, 108023.

[22] Mac Donagh, J., A. Marchesini, A. Spiga, M.J. Fallico, P.N. Arrías, A.M. Monzon, A.C. Vagiona, M. Gonçalves-Kulik, P. Mier and M.A. Andrade-Navarro. 2024. Structured tandem repeats in protein interactions. Int. J. Mol. Sci. 25, 2994.

[23] Erdozain, S., E. Barrionuevo, L. Ripoll, P. Mier and M.A. Andrade-Navarro. 2023. Protein repeats evolve and emerge in giant viruses. J. Struct. Biol. 215, 107962.