Identification of protein repeats

HEAT repeats

We found the first homology of the Huntington's disease protein to other protein sequence [1]. This protein contains a repeat of around 40 amino acids which, at the time, was already described for the alpha subunit of the protein phosphatase 2A. We found and characterised this repeat in a number of eukaryotic cytoplasmic proteins mainly involved in cytoplasmic transport processes and most of them known to be part of protein complexes.

We developed a method [2] for identification of short protein repeats (between 20-40 amino acids long). These repeats are usually very divergent and their recognition is difficult even if having a good profile of the repeat. We observed that the scores of optimal and sub-optimal non-overlapping alignments of a repeat profile against a large database of randomized sequences follow Extreme Value Distributions (EVDs). From the analysis of those EVDs we can associate E-values to multiple non-overlapping hits of a profile repeat against a query sequence. We tested the method for eleven repeat families in the whole SwissProt database, Saccharomyces cerevisiae, Caenorhabditis elegans and Homo Sapiens proteins. We could detect new unrecognised repeats and unify some repeat families. The method is available as the web server REP. An update of the server introduced the possibility to analyse proteins in multiple sequence alignments [3], which is helpful to add support for weak hits by comparison to orthologs. Using such comparisons across multiple organisms, we were able to assess the evolutionary trends of structural repats in eukarya.

The previous work showed the difficulty of classifying ARM and HEAT repeats (which occur in at least 1 in 500 eukaryotic protein sequences). They are similar in sequence and structure but we could not account for both of them with a single profile. We have reviewed these repeats [4] correlating sequence similarity between repeats to functional and structural properties. Several profiles were built that improved their detection. They can be used for scanning protein sequences through the REP server.

We developed a neural network based method [ARD] to detect repeats like HEAT, Armadillo, and PBS, that form similar structures composed of alpha-helices (which we termed alpha-rods) [5]. Using this method allowed detecting novel instances of this structure, for example in human proteins STAG1-3, SERAC1, and PSMD1-2 & 5. Application of the method to human huntingtin and comparison to orthologs allowed us to delimit three alpha-rods in huntingtin whose intra-molecular interactions we characterized experimentally using yeast two hybrid and co-immunoprecipitation of protein fragments encoding the domains. We updated the method by allowing the detection of repeats with an internal linker of variable length ([ARD2][6]). Using ARD2 we evaluated novel structures and the phylogenetic distribution of these repeats, pointing to multiple likely events of independent emergence of these repeats in distant taxa and to their increased frequency in organisms of high cellular complexity such as eukarya in general, and cyanobacteria and planctomycetes within prokarya.


Homorepeats (or polyX) in protein sequences are stretches of consecutive repetitions of a single amino acid residue. There is no consensus to define the minimum number of repeats that is relevant. Examination of conservation in multiple sequence alignments pinpoints conserved homorepeats. To facilitate the study and detection of homorepeats we created a web server that annotates homorepeats in multiple sequence alignments ([dAPE]; [7]).

We studied the differential use of homorepeats across taxa to evaluate their evolution and function [8]. Our results suggest that homorepeats have biological function in the creation or modulation of protein-protein interactions in a context dependent manner, with a tendency to occur outside protein domains and at the protein termini.

Many homorepeats appear relatively quickly by being inserted at a location along evolution. This is evidenced by some orthologs having homorepeats of variable length or lacking them all together. We used this property for the analysis of the structural context of sites where polyQ was inserted during evolution [9]. We could appreciate that polyQ has a bias to be inserted in disordered regions, with some tendency to occur C-terminal of regions with alpha-helical content. This supports its role as a C-terminal modulator of coiled coil interactions, which also have alpha-helical structure.

Regarding the codon usage of polyQ, we observed that, in primates, the use of either CAG (the codon associated to the extension of the repeat by slippage) or CAA, is strongly dependent on the length of the repeat. Lengths between 4 to 8 have higher CAG usage. We also found that human polyQ stretches known for pathological expansion have less CAA than similarly long polyQ. We take these results as indication that CAG repeats of length 4 trigger expansion that can be functionally selected, and that in order to avoid uncontrolled expansion, CAG codons are replaced by CAA in relatively large polyQ; failure to do so results in unstable polyQ prone to expansion and causing disease [10]. Along these lines, we appreciated that the threshold used to define a polyQ results in the selection of homorepeats with different properties [11]. For example, in human proteins, the sequence context near polyQ displays a leucine residue -1 from the polyQ, which is more frequent if the polyQ are selected with a small window. Differently prolines in positions +1 to +5 are found more often if the polyQ are selected with a longer window. These properties change for another species. Codon usage (CAG, CAA or non-polyQ codons) is also very different with CAG above 50% for vertebrates and D. melanogaster, CAA around 50% for C. elegans, S. cerevisiae and A. thaliana, and CAA above 75% for D. discoideum. These results indicate the need of scanning polyQ in an organism to test various thresholds prior to polyQ analyses. We provide the [sQanner] web tool to scan files with protein sequences for polyQ.

We observed that many of the properties we observe for polyQ depend on their evolutionary stability, that is, how variable the regions appear to be when compared among homologs from related species [12]. We define three types of polyQ: inserted (if the region appears to be missing in homologs), mutated (if the region is present  in homologs but sometimes it has fewer Qs), or else stable. We found that stable polyQ are in structural context that we found to be promoting its structural function, and the proteins having them have more known protein interactions. We also found that inserted polyQ in Sauria and Mammalia have much higher CAG codon usage (evidence of expansion by CAG-slippage) than this type of polyQ in other taxa analysed. We conclude that the evolutionary analysis of polyQ is necessary to evaluate polyQ function and requires taxonomy-specific studies.

We supported a collaboration to examine the conformational variability of secondary structure near the polyQ of huntingtin [13]. Our study used site-specific labeled NMR to test structure in individual residues of the polyQ observing the extension of the N-terminal alpha helix into the polyQ region as a gradient of frequency that decreases towards the C-terminal of the polyQ and that is favoured by hydrogen bonds of the helix to the most N-terminal glutamines of the polyQ. The polyP region that follows the polyQ in huntingtin promotes beta-strand structure on the C-terminal part of the polyQ. By sequence analysis we observed that longer polyQ in human proteins have these surrounding structure inducing elements more often than shorter polyQ.

Other protein repeats

We have analysed a large protein family of the Arabidopsis thaliana plant genome [14]. This family contains at least 48 proteins of yet unknown function. We identified kelch repeats (implied in protein-protein interactions) and an F-box domain (which targets proteins for degradation). The demonstration of the in vivo interaction of one of the members of the family with ASK1 (homolog of yeast Skp1p, a subunit of the SCF complex which is involved in the ubiquitination of proteins prior degradation by the 26S proteasome) via the F-box domain, gave some insights into the functionality of this family.

Protein repeats that form structural repeating units that assemble together are quite common in many protein families and organisms. In an invited review we discuss the analysis of such repeats (including computational characterization) and how we think that repetition in protein sequence relates to evolution and function [15].

We identified a protein domain that appears with variable copy number in genes that are usually in the vicinity of a putative Fe3+ siderophore transporter [16]. We denoted this new domain NEAT for NEAr Transporter. Given that this domain seems to be specific of pathogenic bacteria, we suggest that it is a potential target for therapy against disease.

We participated in the characterization of microtubule associated AIR9, a protein that in plants associates to the microtubules of the cortical cells during preprophase and when the plant cortex is contacted by the cell plate (a plant-specific cellular structure that forms during cell division) [17]. This protein contains homologs in trypanosomatid parasites featuring a region with leucin reach repeats and a number of protein tandem repeats. We termed these repeats A9, characterized them in plant, trypanosomes, and bacterial sequences, and predicted them to adopt an immunoglobulin fold. We discussed the phylogeny of the AIR9 proteins with novel sequence evidence and discuss the especial amino acid bias in the plant members of this family [18].

Periostin is a protein of the extracellular matrix. Despite its proven association to bone and heart development and to cancer, its function currently remains elusive. By sequence and database analyses we characterized the variability of Periostin's C-terminal in terms of exon count, length, and alternative splicing, and the existence of a 13-amino acid repeat that we predict to form consecutive beta strands [19]. These findings are put in the context of functional and structural predictions.

In some situations, even after resolution of a protein's 3D structure, the definition of protein repeats may be under debate. For example, we clarified the presence of armadillo repeats in p115, a structural component of the Golgi apparatus that facilitates the tethering of transport vesicles inbound from the endoplasmic reticulum to the cis-Golgi membrane, following conflicting interpretations of its structure [20].

We characterized a region of 15 repeats of around 10 amino acids in the human mineralocorticoid receptor (MR) [21]. The MR is part of the renin angiotensin aldosterone system (RAAS). This protein has an inhibitory domain of unknown structure. We predict that the repeats region adopts a beta-solenoid structure and propose how this could be involved in phosphorylation dependent inter- and intra-molecular interactions.

Using sequence similarity analyses, we identified a region of tandem repeats covering the C-terminal 2/3 of the TPX2 protein [22]. TPX2, conserved in plants and chordata, is essential for spindle pole formation and controls the nucleation of microtubules on chromosomes during mitosis. There was so far no structural information about this protein. Using structure predictions we support that the region of the repeats forms an alpha helical solenoid, which we support with CD spectra that indicates high alpha-helical content in Xenopus (frog) and Arabidopsis (plant) TPX2.

Tandem repeats and structures

RepeatsDB is a database of protein tandem repeats of known structure derived from protein 3D structures. The Repeats DB 2.0 update included annotations from more than 5400 structures, 60% of them manually curated [23]. Repeats are classified in five categories according to their length and general arrangement, with subclasses that depend on secondary structure content. The Repeats DB 3.0 update extended the classification scheme separating three hierarchical levels based on structural similarity (class, topology and fold) from the two lower levels that consider sequence similarity, clan, for repeat motifs, and family, which considers homology [24].

While in theory encoding repeat units in separate exons would make them easier to duplicate, it is not the case that most tandem repeats are encoded in this way. On the other hand, observing the cases where this happens can help defining what are the structural units forming a tandem repeat and their phase. To approach these questions we characterized the correspondences between exon boundaries and structures for a number of tandem repeat proteins [25]. Different types of repeats have different behaviours with some being more prone to have a high correspondence; encoding of two consecutive tandem repeats was an often observed feature. Such observations should help the detection and classification of tandem repeats.



[1] Andrade, M.A. and P. Bork. 1995. HEAT repeats in the Huntington's disease protein. Nature Genetics, 11, 115-116.

[2] Andrade, M.A., C.P. Ponting, T.J. Gibson and P. Bork. 2000. Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol. 298, 521-537. [REP]

[3] Kamel, M., K. Kastano, P. Mier and M.A. Andrade-Navarro. 2021. REP2: a web server to detect common tandem repeats in protein sequences. J. Mol. Biol.  433, 166895. [REP]

[4] Andrade, M.A., C. Petosa, S.I. O'Donoghue, C.W. Müller and P. Bork. 2001. Comparison of ARM and HEAT repeat proteins. J. Mol. Biol. 309, 1-18.

[5] Palidwor, G.A., S. Shcherbinin, M.R. Huska, T. Rasko, U. Stelzl, A. Arumughan, R. Foulle, P. Porras, L. Sanchez-Pulido, E.E. Wanker, M.A. Andrade-Navarro. 2009. Detection of alpha-rod repeats using a neural network and application to huntingtin. PLoS Comp. Biol. 5, e1000304. [ARD]. httcover_small

[6] Fournier, D., G.A. Palidwor, S. Shcherbinin, A. Szengel, M.H. Schaefer, C. Perez-Iratxeta and M.A. Andrade-Navarro. 2013. Functional and genomic analyses of alpha-solenoid proteins. PLoS One. 8, e79894. [ARD2].

[7] Mier, P. and M.A. Andrade-Navarro. 2017. dAPE: a web server to detect homorepeats and follow their evolution. Bioinformatics. 33, 1221-1223. [dAPE]

[8] Mier, P., G. Alanis-Lobato and M.A. Andrade-Navarro. 2017. Context characterization of amino acid homorepeats using evolution, position and order. Proteins. 85, 709-719.

[9] Totzeck, F., M.A. Andrade-Navarro and P. Mier. 2017. The protein structure context of polyQ regions. PLoS One. 12, e0170801.

[10] Mier, P. and M.A. Andrade. 2018. Glutamine codon usage and polyQ evolution in primates depend on the Q stretch length. Genome Biology and Evolution. 10, 816-825.

[11] Mier, P., C. Elena-Real, A. Urbanek, P. Bernadó and M.A. Andrade-Navarro. 2020. The importance of definitions in the study of polyQ regions: a tale of thresholds, impurities and sequence context. Comput. Struct. Biotechnol. Journal. 18, 306-313. [sQanner]

[12] Mier, P. and M.A. Andrade-Navarro. 2020. The features of polyglutamine regions depend on their evolutionary stability. BMC Evol. Biol. 20, 59.

[13] Urbanek A., M. Popovic, A. Morató, A. Estaña, C.A. Elena-Real, P. Mier, A. Fournet, F. Allemand, S. Delbecq, M.A. Andrade-Navarro, J. Cortés, N. Sibille and P. Bernadó. 2020. Flanking regions determine the structure of the poly-glutamine homo-repeat in huntingtin through mechanisms common amongst glutamine/rich human proteins. Structure. 28, 733-746.e5.

[14] Andrade, M.A., M. González-Guzmán, R. Serrano and P.L. Rodríguez. 2001. A combination of the F-box motif and kelch repeats defines a large Arabidopsis family of F-box proteins Plant Mol. Biol. 46, 603-614.

[15] Andrade, M.A., C. Perez-Iratxeta, and C.P. Ponting. 2001. Protein repeats: structures, functions and evolution. Journal of Structural Biology. 84, 445-451.

[16] Andrade, M.A., F.D. Ciccarelli, C. Perez-Iratxeta and P. Bork. 2002. NEAT: A domain duplicated in genes near the components of a putative Fe3+ siderophore transporter from Gram-positive pathogenic bacteria. Genome Biology. 3, research0047.1-0047.5.

[17] Buschmann, H., J. Chan, L. Sanchez-Pulido, M.A. Andrade-Navarro, J.H. Doonan and C.W. Lloyd. 2006. Microtubule associated AIR9 recognizes the cortical division site at preprophase and again when the cell plate inserts. Current Biology. 2, 296-299.

[18] Buschmann, H., L. Sanchez-Pulido, M.A. Andrade-Navarro and C.W. Lloyd. 2007. Homologues of Arabidopsis microtubule-associated AIR9 in trypanosomatid parasites: hints on evolution and function. Plant Signaling & Behavior. 16, 1938-1943.

[19] Hoersch, S. and M.A. Andrade-Navarro. 2010. Periostin shows increased evolutionary plasticity in its alternatively spliced region. BMC Evolutionary Biology. 10, 30.

[20] Striegl, H., M.A. Andrade-Navarro, U. Heinemann. 2010. Armadillo motifs involved in vesicular transport. PLoS ONE5, e8991.

[21] Vlassi, M., K. Brauns and M.A. Andrade-Navarro. 2013. Short tandem repeats in the inhibitory domain of the mineralocorticoid receptor: prediction of a ß-solenoid structure. BMC Structural Biology. 13, 17.

[22] Sanchez-Pulido, L., L.H. Perez, S. Kuhn, I. Vernos and M.A. Andrade-Navarro. 2016. The C-terminal domain of TPX2 is made of alpha-helical tandem repeats. BMC Structural Biology. 16, 17.

[23] Paladin, L., L. Hirsch, D. Piovesan, M.A. Andrade-Navarro, A.V. Kajava and Silvio C.E. Tosatto. 2016. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures. Nucleic Acids Research. 45, 3613. [RepeatsDB]

[24] Paladin, L., M. Bevilacqua, S. Errigo, D. Piovesan, I. Mičetić, M. Necci, A.M. Monzon, M.L. Fabre, J.L. Lopez, J.F. Nilsson, J. Rios, P. Lorenzano Menna, M. Cabrera, M. Gonzalez Buitron, M. Gonçalves Kulik, S. Fernandez-Alberti, M.S. Fornasari, G. Parisi, A. Lagares, L. Hirsh, M.A. Andrade-Navarro, A.V. Kajava, and S.C.E. Tosatto. 2021. RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures. Nucleic Acids Research. 49, D452-D457. [RepeatsDB]

[25] Paladin, L., M. Necci, D. Piovesan, P. Mier, M.A. Andrade-Navarro and S.C.E. Tosatto. 2020. A novel approach to investigate the evolution of structured tandem repeat protein families by exon duplication. J. Struc Biol. 212, 107608.