Identification of protein tandem repeats

Tandem repeats in protein sequences generate folding structural units that assemble together forming open elongated flexible structures or closed structures (barrels and propellers) [1]. They can evolve quickly because duplication or deletion of a repeat unit is likely to result in a structure that can fold as well. They generate structures that are flexible and have large surfaces, good for interactions with proteins, and thus are often found in large protein complexes. The evolutionary constraints on the sequences of tandem repeats may be lower than for globular proteins as the structure depends on fewer interacting residues; this can cause large divergence in the sequence of the repeats, which complicates their detection by sequence similarity.

The discovery of HEAT repeats

We found the first homology of the Huntington's disease protein to other protein sequence [2]. This protein contains a repeat of around 40 amino acids which, at the time, was already described for the alpha subunit of the protein phosphatase 2A. We found and characterised this repeat in a number of eukaryotic cytoplasmic proteins mainly involved in cytoplasmic transport processes and most of them known to be part of protein complexes.

Methods for identification of repeats

We developed a method [3] for identification of short protein repeats (between 20-40 amino acids long). These repeats are usually very divergent and their recognition is difficult even if having a good profile of the repeat. We observed that the scores of optimal and sub-optimal non-overlapping alignments of a repeat profile against a large database of randomized sequences follow Extreme Value Distributions (EVDs). From the analysis of those EVDs we can associate E-values to multiple non-overlapping hits of a profile repeat against a query sequence. We tested the method for eleven repeat families in the whole SwissProt database, Saccharomyces cerevisiae, Caenorhabditis elegans and Homo Sapiens proteins. We could detect new unrecognised repeats and unify some repeat families. The method is available as the web server REP. An update of the server introduced the possibility to analyse proteins in multiple sequence alignments [4], which is helpful to add support for weak hits by comparison to orthologs. Using such comparisons across multiple organisms, we were able to assess the evolutionary trends of structural repeats in eukarya.

The previous work showed the difficulty of classifying ARM and HEAT repeats (which occur in at least 1 in 500 eukaryotic protein sequences). They are similar in sequence and structure but we could not account for both of them with a single profile. We have reviewed these repeats [5] correlating sequence similarity between repeats to functional and structural properties. Several profiles were built that improved their detection. They can be used for scanning protein sequences through the REP server.

We developed a neural network based method [ARD] to detect repeats like HEAT, Armadillo, and PBS, that form similar structures composed of alpha-helices (which we termed alpha-rods) [6]. Using this method allowed detecting novel instances of this structure, for example in human proteins STAG1-3, SERAC1, and PSMD1-2 & 5. Application of the method to human huntingtin and comparison to orthologs allowed us to delimit three alpha-rods in huntingtin whose intra-molecular interactions we characterized experimentally using yeast two hybrid and co-immunoprecipitation of protein fragments encoding the domains. We updated the method by allowing the detection of repeats with an internal linker of variable length ([ARD2][7]). Using ARD2 we evaluated novel structures and the phylogenetic distribution of these repeats, pointing to multiple likely events of independent emergence of these repeats in distant taxa and to their increased frequency in organisms of high cellular complexity such as eukarya in general, and cyanobacteria and planctomycetes within prokarya.

We developed a method and a web tool to identify duplications of protein short tandem repeats (pSTRs) from protein pairwise alignments ([pSTR], [8]). Study of orthologs from 12 complete metazoan proteomes suggests that at least 3% of amino acids in sequences are covered by pSTRs. We identified protein families with higher frequency of pSTR variation, particularly of proteins involved in liquid-liquid phase separation, suggesting that evolutionary pressure for repeat unit variation could be associated with particular protein functions.

Identification of novel repeats

We have analysed a large protein family of the Arabidopsis thaliana plant genome [9]. This family contains at least 48 proteins of yet unknown function. We identified Kelch repeats (implied in protein-protein interactions) and an F-box domain (which targets proteins for degradation). The demonstration of the in vivo interaction of one of the members of the family with ASK1 (homolog of yeast Skp1p, a subunit of the SCF complex which is involved in the ubiquitination of proteins prior degradation by the 26S proteasome) via the F-box domain, gave some insights into the functionality of this family.

Protein repeats that form structural repeating units that assemble together are quite common in many protein families and organisms. In an invited review we discuss the analysis of such repeats (including computational characterization) and how we think that repetition in protein sequence relates to evolution and function [10].

We identified a protein domain that appears with variable copy number in genes that are usually in the vicinity of a putative Fe3+ siderophore transporter [11]. We denoted this new domain NEAT for NEAr Transporter. Given that this domain seems to be specific of pathogenic bacteria, we suggest that it is a potential target for therapy against disease.

We participated in the characterization of microtubule associated AIR9, a protein that in plants associates to the microtubules of the cortical cells during preprophase and when the plant cortex is contacted by the cell plate (a plant-specific cellular structure that forms during cell division) [12]. This protein contains homologs in trypanosomatid parasites featuring a region with leucin reach repeats and a number of protein tandem repeats. We termed these repeats A9, characterized them in plant, trypanosomes, and bacterial sequences, and predicted them to adopt an immunoglobulin fold. We discussed the phylogeny of the AIR9 proteins with novel sequence evidence and discuss the especial amino acid bias in the plant members of this family [13].

Periostin is a protein of the extracellular matrix. Despite its proven association to bone and heart development and to cancer, its function currently remains elusive. By sequence and database analyses we characterized the variability of Periostin's C-terminal in terms of exon count, length, and alternative splicing, and the existence of a 13-amino acid repeat that we predict to form consecutive beta strands [14]. These findings are put in the context of functional and structural predictions.

In some situations, even after resolution of a protein's 3D structure, the definition of protein repeats may be under debate. For example, we clarified the presence of armadillo repeats in p115, a structural component of the Golgi apparatus that facilitates the tethering of transport vesicles inbound from the endoplasmic reticulum to the cis-Golgi membrane, following conflicting interpretations of its structure [15].

We characterized a region of 15 repeats of around 10 amino acids in the human mineralocorticoid receptor (MR) [16]. The MR is part of the renin angiotensin aldosterone system (RAAS). This protein has an inhibitory domain of unknown structure. We predict that the repeats region adopts a beta-solenoid structure and propose how this could be involved in phosphorylation dependent inter- and intra-molecular interactions.

Using sequence similarity analyses, we identified a region of tandem repeats covering the C-terminal 2/3 of the TPX2 protein [17]. TPX2, conserved in plants and chordata, is essential for spindle pole formation and controls the nucleation of microtubules on chromosomes during mitosis. There was so far no structural information about this protein. Using structure predictions we support that the region of the repeats forms an alpha helical solenoid, which we support with CD spectra that indicates high alpha-helical content in Xenopus (frog) and Arabidopsis (plant) TPX2.

Tandem repeats and protein structures

RepeatsDB is a database of protein tandem repeats of known structure derived from protein 3D structures. The Repeats DB 2.0 update included annotations from more than 5400 structures, 60% of them manually curated [18]. Repeats are classified in five categories according to their length and general arrangement, with subclasses that depend on secondary structure content. The Repeats DB 3.0 update extended the classification scheme separating three hierarchical levels based on structural similarity (class, topology and fold) from the two lower levels that consider sequence similarity, clan, for repeat motifs, and family, which considers homology [19].

While in theory encoding repeat units in separate exons would make them easier to duplicate, it is not the case that most tandem repeats are encoded in this way. On the other hand, observing the cases where this happens can help defining what are the structural units forming a tandem repeat and their phase. To approach these questions we characterized the correspondences between exon boundaries and structures for a number of tandem repeat proteins [20]. Different types of repeats have different behaviours with some being more prone to have a high correspondence; encoding of two consecutive tandem repeats was an often observed feature. Such observations should help the detection and classification of tandem repeats.

Because the definition of Tandem Repeats in Proteins (TRPs) is a bit too general, we proposed that it is necessary to define a sub-category of Structural TRPs (STRPs) as those that have a structure that can be solved experimentally or predicted [21]. We point to other properties in which STRPs are different respect to general TRPs: for example, having complex sequence composition and low sequence similarity between repeats.

Repeats in giant viruses

Considering that the mechanisms of tandem repeat duplication (and deletion) must result from the replication machinery, we wondered if viruses, which use the replication machinery of the host, would be able to use it to gain repeats in their proteins and if this would induce some biases to the types of repeats found in viruses. For this analysis, we studied the proteomes of Nucleocytoplasmatic large DNA viruses (NCLDVs or giant viruses) because they encode hundreds of proteins [22]. We found all repeat lengths in their proteins, from homorepeats, to short tandem repeats (leading to composition bias and unlikely to gain structure), to larger structured repeats. All repeats were found to dynamically evolve within the viral lineages; both homorepeats and short tandem repeats emerge in viruses while structured repeats were adopted by horizontal transfer. We conclude that giant viruses are good models for the study of the evolution of tandem repeats.



[1] Andrade, M.A., C. Perez-Iratxeta, and C.P. Ponting. 2001. Protein repeats: structures, functions and evolutionJournal of Structural Biology134, 117-131.

[2] Andrade, M.A. and P. Bork. 1995. HEAT repeats in the Huntington's disease protein. Nature Genetics, 11, 115-116.

[3] Andrade, M.A., C.P. Ponting, T.J. Gibson and P. Bork. 2000. Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol. 298, 521-537. [REP]

[4] Kamel, M., K. Kastano, P. Mier and M.A. Andrade-Navarro. 2021. REP2: a web server to detect common tandem repeats in protein sequences. J. Mol. Biol.  433, 166895. [REP]

[5] Andrade, M.A., C. Petosa, S.I. O'Donoghue, C.W. Müller and P. Bork. 2001. Comparison of ARM and HEAT repeat proteins. J. Mol. Biol. 309, 1-18.

[6] Palidwor, G.A., S. Shcherbinin, M.R. Huska, T. Rasko, U. Stelzl, A. Arumughan, R. Foulle, P. Porras, L. Sanchez-Pulido, E.E. Wanker, M.A. Andrade-Navarro. 2009. Detection of alpha-rod repeats using a neural network and application to huntingtin. PLoS Comp. Biol. 5, e1000304. [ARD]. httcover_small

[7] Fournier, D., G.A. Palidwor, S. Shcherbinin, A. Szengel, M.H. Schaefer, C. Perez-Iratxeta and M.A. Andrade-Navarro. 2013. Functional and genomic analyses of alpha-solenoid proteins. PLoS One. 8, e79894. [ARD2].

[8] Mier, P. and M.A. Andrade-Navarro. 2023. pSTR: detection of protein Short Tandem Repeats in protein families. Biomolecules. 13, 1116. [pSTR]

[9] Andrade, M.A., M. González-Guzmán, R. Serrano and P.L. Rodríguez. 2001. A combination of the F-box motif and kelch repeats defines a large Arabidopsis family of F-box proteins Plant Mol. Biol. 46, 603-614.

[10] Andrade, M.A., C. Perez-Iratxeta, and C.P. Ponting. 2001. Protein repeats: structures, functions and evolution. Journal of Structural Biology. 84, 445-451.

[11] Andrade, M.A., F.D. Ciccarelli, C. Perez-Iratxeta and P. Bork. 2002. NEAT: A domain duplicated in genes near the components of a putative Fe3+ siderophore transporter from Gram-positive pathogenic bacteria. Genome Biology. 3, research0047.1-0047.5.

[12] Buschmann, H., J. Chan, L. Sanchez-Pulido, M.A. Andrade-Navarro, J.H. Doonan and C.W. Lloyd. 2006. Microtubule associated AIR9 recognizes the cortical division site at preprophase and again when the cell plate inserts. Current Biology. 2, 296-299.

[13] Buschmann, H., L. Sanchez-Pulido, M.A. Andrade-Navarro and C.W. Lloyd. 2007. Homologues of Arabidopsis microtubule-associated AIR9 in trypanosomatid parasites: hints on evolution and function. Plant Signaling & Behavior. 16, 1938-1943.

[14] Hoersch, S. and M.A. Andrade-Navarro. 2010. Periostin shows increased evolutionary plasticity in its alternatively spliced region. BMC Evolutionary Biology. 10, 30.

[15] Striegl, H., M.A. Andrade-Navarro, U. Heinemann. 2010. Armadillo motifs involved in vesicular transport. PLoS ONE5, e8991.

[16] Vlassi, M., K. Brauns and M.A. Andrade-Navarro. 2013. Short tandem repeats in the inhibitory domain of the mineralocorticoid receptor: prediction of a ß-solenoid structure. BMC Structural Biology. 13, 17.

[17] Sanchez-Pulido, L., L.H. Perez, S. Kuhn, I. Vernos and M.A. Andrade-Navarro. 2016. The C-terminal domain of TPX2 is made of alpha-helical tandem repeats. BMC Structural Biology. 16, 17.

[18] Paladin, L., L. Hirsch, D. Piovesan, M.A. Andrade-Navarro, A.V. Kajava and Silvio C.E. Tosatto. 2016. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures. Nucleic Acids Research. 45, 3613. [RepeatsDB]

[19] Paladin, L., M. Bevilacqua, S. Errigo, D. Piovesan, I. Mičetić, M. Necci, A.M. Monzon, M.L. Fabre, J.L. Lopez, J.F. Nilsson, J. Rios, P. Lorenzano Menna, M. Cabrera, M. Gonzalez Buitron, M. Gonçalves Kulik, S. Fernandez-Alberti, M.S. Fornasari, G. Parisi, A. Lagares, L. Hirsh, M.A. Andrade-Navarro, A.V. Kajava, and S.C.E. Tosatto. 2021. RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures. Nucleic Acids Research. 49, D452-D457. [RepeatsDB]

[20] Paladin, L., M. Necci, D. Piovesan, P. Mier, M.A. Andrade-Navarro and S.C.E. Tosatto. 2020. A novel approach to investigate the evolution of structured tandem repeat protein families by exon duplication. J. Struc Biol. 212, 107608.

[21] Monzon, A.M., P.N. Arrías, A. Elofsson, P. Mier, M.A. Andrade-Navarro, M. Bevilacqua, D. Clementel, A. Bateman, L. Hirsh, M.S. Fornasari, G. Parisi, D. Piovesan, A.V. Kajava, S.C.E. Tosatto. 2023. A STRP-ed definition of Structured Tandem Repeats in Proteins. J. Struct. Biol. 215, 108023.

[22] Erdozain, S., E. Barrionuevo, L. Ripoll, P. Mier and M.A. Andrade-Navarro. 2023. Protein repeats evolve and emerge in giant viruses. J. Struct. Biol. 215, 107962.