Identification of homorepeats | Computational Biology and Data Mining

Homorepeats (or polyX) in protein sequences are stretches of consecutive repetitions of a single amino acid residue. There is no consensus to define the minimum number of repeats that is relevant. Examination of conservation in multiple sequence alignments pinpoints conserved homorepeats. To facilitate the study and detection of homorepeats we created a web server that annotates homorepeats in multiple sequence alignments ([dAPE]; [1]).

We studied the differential use of homorepeats across taxa to evaluate their evolution and function [2]. Our results suggest that homorepeats have biological function in the creation or modulation of protein-protein interactions in a context dependent manner, with a tendency to occur outside protein domains and at the protein termini. We developed a web tool to query the set of homorepeats of 23,150 completely-sequenced proteomes ([Xsurvey]; [3]). The tool provides a graphical display to compare the relative use of polyX types across selected species.

Many homorepeats appear relatively quickly by being inserted at a location along evolution. This is evidenced by some orthologs having homorepeats of variable length or lacking them all together. We used this property for the analysis of the structural context of sites where polyQ was inserted during evolution [4]. We could appreciate that polyQ has a bias to be inserted in disordered regions, with some tendency to occur C-terminal of regions with alpha-helical content. This supports its role as a C-terminal modulator of coiled coil interactions, which also have alpha-helical structure.

Regarding the codon usage of polyQ, we observed that, in primates, the use of either CAG (the codon associated to the extension of the repeat by slippage) or CAA, is strongly dependent on the length of the repeat. Lengths between 4 to 8 have higher CAG usage. We also found that human polyQ stretches known for pathological expansion have less CAA than similarly long polyQ. We take these results as indication that CAG repeats of length 4 trigger expansion that can be functionally selected, and that in order to avoid uncontrolled expansion, CAG codons are replaced by CAA in relatively large polyQ; failure to do so results in unstable polyQ prone to expansion and causing disease [5]. Along these lines, we appreciated that the threshold used to define a polyQ results in the selection of homorepeats with different properties [6]. For example, in human proteins, the sequence context near polyQ displays a leucine residue -1 from the polyQ, which is more frequent if the polyQ are selected with a small window. Differently prolines in positions +1 to +5 are found more often if the polyQ are selected with a longer window. These properties change for another species. Codon usage (CAG, CAA or non-polyQ codons) is also very different with CAG above 50% for vertebrates and D. melanogaster, CAA around 50% for C. elegans, S. cerevisiae and A. thaliana, and CAA above 75% for D. discoideum. These results indicate the need of scanning polyQ in an organism to test various thresholds prior to polyQ analyses. We provide the [sQanner] web tool to scan files with protein sequences for polyQ.

We observed that many of the properties we observe for polyQ depend on their evolutionary stability, that is, how variable the regions appear to be when compared among homologs from related species [7]. We define three types of polyQ: inserted (if the region appears to be missing in homologs), mutated (if the region is present in homologs but sometimes it has fewer Qs), or else stable. We found that stable polyQ are in structural context that we found to be promoting its structural function, and the proteins having them have more known protein interactions. We also found that inserted polyQ in Sauria and Mammalia have much higher CAG codon usage (evidence of expansion by CAG-slippage) than this type of polyQ in other taxa analysed. We conclude that the evolutionary analysis of polyQ is necessary to evaluate polyQ function and requires taxonomy-specific studies.

We supported a collaboration to examine the conformational variability of secondary structure near the polyQ of huntingtin [8]. Our study used site-specific labeled NMR to test structure in individual residues of the polyQ observing the extension of the N-terminal alpha helix into the polyQ region as a gradient of frequency that decreases towards the C-terminal of the polyQ and that is favoured by hydrogen bonds of the helix to the most N-terminal glutamines of the polyQ. The polyP region that follows the polyQ in huntingtin promotes beta-strand structure on the C-terminal part of the polyQ. By sequence analysis we observed that longer polyQ in human proteins have these surrounding structure inducing elements more often than shorter polyQ.

We reviewed the current knowledge on polyQ, focusing mostly on protein aspects, discussing about its definition, sequence and structure context, and functional relevance, with an emphasis in the idea that it is a motif that can modulate protein-protein interactions [9]. In these respect, disease aspects of polyQ, related to aggregates that may happen if a polyQ is pathologically expanded, should not distract researchers from the functional aspects of polyQ. While generally abundant in eukaryotes, its highest frequency in a few unicellular species (well above that in human) is intriguing and hints to species-specific functions.

To support the detection of homorepeats in protein sequences, we provide a method to scan sequences for them that allows to modify the minimum length and presence of other "guest" amino acids in the homorepeat. The method can be run via a web tool or using the computer code provided ([PolyX2]; [10]).

We studied the sequence context of polyA in 18 eukaryotic species (including human) to understand how it influences their structure and function [11]. We found that glycine and proline are the most frequent amino acids within polyA and in surrounding positions; we hypothesize that these amino acids have a role in reducing the propensity of long polyA to aggregate. We find that short polyA can evolve from alanine substitutions in alpha-helices but evolution by insertion is also observed for longer polyA. PolyA is frequently observed in protein N-termini: right after the initial methionine in mitochondrial transit peptides, or downstream in signal peptides.

PolyQ and polyA are known to be involved in protein-protein interactions (PPIs) but it is unlikely that all of these regions do it. We hypothesized that polyQ and polyA sequence properties could be predictive of their participation in PPIs. To test this, we trained machine learning (ML) methods using 157 polyQ and 745 polyA regions situated at interacting surfaces of human proteins (from a background of 2085 polyQ and 7263 polyA regions) [12]. Homorepeat length and composition of amino acids surrounding the homorepeat were predictive features, with polyA prediction benefiting from a longer range (ten amino acids on each side) than polyQ (six amino acids on each side). Considering the presence of coiled coils improved the predictions for polyA but not for polyQ. Short homorepeats are predicted to bind more often and having a proline or a glycine in the third position after the polyP and polyA, respectively, was also indicative of interaction.

We studied the variation of homorepeats in the human population using 125,748 exomes and 15,708 whole genomes available from the Genome Aggregation Database (gnomAD) [13]. Homorepeats have larger rates of variation than non repeated sequences. Regarding polyQ, shorter ones have more variation than longer ones, which mostly receive deletions. The ones more conserved within primates are also less variable in humans. Much of this variation is likely to have no impact.

References

[1] Mier, P. and M.A. Andrade-Navarro. 2017. dAPE: a web server to detect homorepeats and follow their evolution. Bioinformatics. 33, 1221-1223. [dAPE]

[2] Mier, P., G. Alanis-Lobato and M.A. Andrade-Navarro. 2017. Context characterization of amino acid homorepeats using evolution, position and order. Proteins. 85, 709-719.

[3] Andrade-Navarro, M.A. and P. Mier, P. 2025. Xsurvey: web tool to query the set of homorepeats of all reference proteomes. IEEE Trans. Comput. Biol. Bioinfo. In press. [Xsurvey]

[4] Totzeck, F., M.A. Andrade-Navarro and P. Mier. 2017. The protein structure context of polyQ regions. PLoS One. 12, e0170801.

[5] Mier, P. and M.A. Andrade. 2018. Glutamine codon usage and polyQ evolution in primates depend on the Q stretch length. Genome Biology and Evolution. 10, 816-825.

[6] Mier, P., C. Elena-Real, A. Urbanek, P. Bernadó and M.A. Andrade-Navarro. 2020. The importance of definitions in the study of polyQ regions: a tale of thresholds, impurities and sequence context. Comput. Struct. Biotechnol. Journal. 18, 306-313. [sQanner]

[7] Mier, P. and M.A. Andrade-Navarro. 2020. The features of polyglutamine regions depend on their evolutionary stability. BMC Evol. Biol. 20, 59.

[8] Urbanek A., M. Popovic, A. Morató, A. Estaña, C.A. Elena-Real, P. Mier, A. Fournet, F. Allemand, S. Delbecq, M.A. Andrade-Navarro, J. Cortés, N. Sibille and P. Bernadó. 2020. Flanking regions determine the structure of the poly-glutamine homo-repeat in huntingtin through mechanisms common amongst glutamine-rich human proteins. Structure. 28, 733-746.e5.

[9] Mier, P. and M.A. Andrade-Navarro. 2021. Between interactions and aggregates: the polyQ balance. Genome Biology and Evolution. 13, evab246.

[10] Mier, P. and M.A. Andrade-Navarro. 2022. PolyX2: fast detection of homorepeats in large protein datasets. Genes. 13, 758. [PolyX2]

[11] Mier, P., C. Elena-Real, J. Cortés, P. Bernadó and M.A. Andrade-Navarro. 2022. The sequence context in poly-Alanine regions: Structure, function and conservation. Bioinformatics. 38, 4851-4858..

[12] Mier, P. and M.A. Andrade-Navarro. 2024. Predicting the involvement of polyQ and polyA in protein-protein interactions by their amino acid context. Heliyon. 10, e37861.

[13] Mier, P., M.A. Andrade-Navarro and E. Morett. 2024. Homorepeat variability within the human population. NAR Genom. Bioinform. 6, lqae053