Protein motif and composition analysis | Computational Biology and Data Mining

Conservation of tyrosine phosphorylation sites

In an approach to evaluate human protein kinases in a high-throughput fashion, we used yeast, which lacks a system for the phosphorylation of protein tyrosines, as an in vivo model system [1]. Expressing individual non-receptor tyrosine kinases (NRTKs) in yeast leads to the tyrosine phosphorylations of yeast proteins. This reproduces NRTK activity in human cells on the corresponding conserved orthologs. We used conservation analysis of putative sites of tyrosine phosphorylation (constrasting orthologs in organisms with NRTKs versus those without) to evaluate and select these candidates. Network analysis shows that individual NRTKs phosphorylate proteins that interact with each other, suggesting that motifs for the recognition of protein phosphorylation sites play a lesser role than previously thought. We predict relations between NRTKs and more than 3500 human target proteins.

Comparison of amino acid composition between protein domains and linkers

While it is well known that there is variability of amino acid composition in proteins across taxa, these differences in terms of protein domains and linkers have been less studied. We studied these using 38 proteomes [2]. The usage of polar residues in linkers and hydrophobic residues in globular domains was observed as expected. Focusing on particular types of domains can be more insightful. For example, while Arg usage in DNA-binding domains is high, their surrounding linkers are enriched in Ser, which are often target of phosphorylation in disordered regions. We created an R script to facilitate and visualize these analyses (RACCOON).

Intrinsically disordered regions

Intrinsically disordered regions (IDRs) in protein sequences are defined by their lack of structure. They might correspond to compositionally biased regions (CBRs) but this is not necessarily so. We provided a protocol to study them from an evolutionary point of view in two steps [3]. First, we profiled IDR content for more than 10,695 available complete proteomes. We could observe certain increase of disorder content with number of cell types within vertebrates. But also that there are particular single celled species with remarkably high content of disorder, suggesting functional and environmental reasons for the presence of disorder in proteins. In a second, focused approach, we compared disorder across orthologs of human, mouse, fish, fly and yeast to find signatures of disorder evolution. We found a tendency in subunits from complexes to share gain of IDRs stressing the function of disorder in the modulation of protein interactions. Compositional analysis of these orthologs suggested that widely conserved CBRs are E- and K-rich while Q- and A-rich regions are more species specific.

We studied the types of CBRs specifically found overlapping or within IDRs as a means to categorize IDRs. CBRs of residues R, H, N, D, P and G tend to be fully included in long IDRs, while CBRs of residues S, E, K and T overlap shorter IDRs and CBRs of Q and A overlap the terminals of short IDRs [4]. More frequent overlaps of CBRs in IDRs occur in proteins involved in liquid-liquid phase separation (LLPS) and enrichment of motifs in networks of interacting proteins suggests that CBRs might result from the repetition of short linear motifs rich in S or P within IDRs, and from E-rich polar regions supporting protein interactions.

We contributed to the 2023 version of DisProt, a database of IDRs annotated with experimental information [5]. The annotations use standard terms for the description of experiments assessing disorder and Gene Ontology terms for functions of IDRs. A strong correlation of the annotated regions with the pLDDT score from the structure predictor AlphaFold, supports the use of low values of this score for IDR prediction.

Low complexity regions

We have reviewed the definition of low complexity regions (LCRs) in sequences [6]. We focused particularly in the methods to measure LCRs, and in the relation of LCRs to composition bias, repeats, disorder and structure. For example, while compositional bias tends to be associated to LCRs and disorder, short repeats, which are compositionally biased, can induce structure. We use a series of examples to illustrate these overlapping aspects. In this respect, we developed a method and an associated web tool to provide a visualization of the "repeatability" of a protein sequence (RES, [7]). We defined this for a window as the fraction of residues that do not need to be changed for the sequence to be composed of perfect repeats. Application of the method to complete proteomes suggests intriguing differences between species regarding the repeatability of their sequences, e.g. a depletion in repeats of odd lengths in Saccharomyces cerevisiae and a few other species, and a large number of repeats of length 2 and 7 in Danio rerio and Arabidopsis thaliana, respectively. We developed further the concept of the "low complexity triangle" as a means to represent the balance between repeatability and bias of LCRs in proteins [8]. We implemented the possibility to generate such a visualization in a web tool (LCT).

We contributed to the creation of the first meta web-server for the analysis of low complexity regions in protein sequences (PlaToLoCo, [9]).

We applied an evolutionary approach for the study of LCRs involvement in protein interactions using Huntingtin (HTT) [10]. We collected the LCRs from HTT interactors: coiled coils, IDRs, homorepeats (polyX) and CBRs, and analysed their conservation across orthologs and their co-occurrence to identify possible LCR-mediated modes of protein interaction. We study the currently only known structure of HTT, which is in complex with HAP40, from this perspective and infer that proteins RASA1, SYN2 and KAT2B may bind to HTT using LCRs as HAP40 does.

LCRs are less frequent and less conserved in prokaryotes compared to eukaryotes and their function in those organisms is under debate. Since bacteria can have very fast rates of evolution we wanted to examine how LCRs behave in this short evolutionary range to detect possible LCR conservation indicative of function. For this we compared two types of amino acid enriched LCRs (compositionally biased regions and homorepeats) across orthologs of bacterial strains [11]. We found that Q-rich CBRs are the most conserved and that A-rich CBRs and polyA are extremely variable. The abundance of LCRs is higher in extracellular and outer-membrane proteins. Regarding conservation of LCRs in the proteins of pathogens: these appear to be more conserved in extracellular proteins but polyX in particular seem to be very variable in outer membrane proteins.

While homorepeats (polyX) have been characterized and defined extensively, regions composed of two types of residues (polyXY) are another type of LCR that also exist in many proteins but have not been characterized and defined. We provided definitions and assessment of the usage of polyXY in proteins, considering separately cases we name direpeats polyXY (e.g. "XYXYXY") and joined polyXY (short polyX followed by short polyY, e.g. "XXXYYY") as opposed to shuffled polyXY, anything else [12]. We examined the presence and types of polyXY in 20,343 reference proteomes. We found that polyXY in Eukaryota are mainly located in IDRs.

To investigate the origin and evolution of polyXY we studied their codon usage in human proteins [13]. We observed a higher codon bias within regions encoding polyXY and that the similarity between the codons for the X and the Y in polyXY is higher than for codons for those amino acids in the background of all proteins. These biases are higher in polyXY composed of repeats (joined and direpeat). We interpret these results as supporting the emergence of polyXY from mutations of single-codon polyX, which can further evolve by expansion and contraction of nucleotide repeats by replication slippage (observed for single-codon polyX), which in turn can be disrupted by conversion to shuffled polyXY.

IDRs are predicted as disordered but they often contain LCRs, which we hypothesized could provide them with structural propensity. To test this, we obtained the experimental structures of human proteins (or homologs) and observed that LCRs (formed by one or two amino acids, polyX and polyXY, respectively) within IDRs are more often represented in structures than other parts of IDRs [14]. Glu and Gly were the amino acids more often observed in these LCRs and polyEK was found to induce alpha helical conformation, with coils as the most frequent structure, although beta-strands were also observed. We exploited the wide set of protein structure predictions by AlphaFold to expand this analysis to more human sequences [15]. In addition to confirming the helical propensity of polyE and polyEK, polyQ and polyER were also identified. Generally, these LCRs had charged residues and functional enrichment analysis of proteins containing them indicated an association with functions requiring interactions with DNA and RNA. This is in agreement with evidence of LCRs having mechanisms to provide structure to IDRs upon interactions. Extending the analysis of the context of these LCRs, we found their accumulation at the ends of IDRs, more often following an alpha-helix starting before the IDR which is then extended by the polyXY into the IDR [16]. Using molecular dynamics simulations we could evaluate the sequence specificity and dynamic nature of these helical extensions. We propose LCRs such as polyXY as motifs that can expand helical conformation, possibly in situations triggered by intra- or inter-molecular interactions (fold-upon-binding).

Avoided motifs

We detected and categorized the shortest strings of amino acids absent from large protein datasets such as the human proteome, considering proteins in particular subcellular locations, and in all the proteins from bacteria [17]. Our hypothesis is that some of these strings (which are four amino acids long) represent functional motifs that are negatively selected in particular contexts. As an example, we show that the sequence "WEWW", avoided in human and generally in eukaryotic organisms, corresponds to part of the signature of bacterial thiol activated cytolysins, which are secreted by pathogenic bacteria, suggesting that there is defense mechanism targeting this motif.

References

[1] Corwin, T., J. Woodsmith, F. Apelt, J.F. Fontaine, D. Meierhofer, J. Helmuth, A. Grossmann, M.A. Andrade-Navarro, B.A. Ballif and U. Stelzl. 2017. Defining human tyrosine kinase phosphorylation networks using yeast as an in vivo model substrate. Cell Systems. 5, 128-139.

[2] Brüne, D., M.A. Andrade-Navarro and P. Mier. 2018. Proteome-wide comparison between the amino acid composition of domains and linkers. BMC Research Notes. 11, 117.

[3] Kastano, K., G. Erdős, P. Mier, G. Alanis-Lobato, V.J. Promponas, Z. Dosztányi and M.A. Andrade-Navarro. 2020. Evolutionary study of disorder in protein sequences. Biomolecules. 10, 1413.

[4] Kastano, K., P. Mier, Z. Dosztanyi, V.J. Promponas and M.A. Andrade-Navarro. 2022. Functional tuning of intrinsically disordered regions in human proteins by composition bias. Biomolecules. 12, 1486.

[5] Aspromonte, M.C., M.V. Nugnes, F. Quaglia, A. Bouharoua, DisProt Consortium*, S.C.E. Tosatto, D. Piovesan. 2023. DisProt in 2024: improving function annotation of intrinsically disordered proteins. Nucl. Acids Res. gkad928. *Disprot Consortium = V. Sagris, V. Promponas, A. Chasapi, E. Fichó, G.E. Balatti, G. Parisi, M. González Buitrón, G. Erdos, M. Pajkos, Z. Dosztányi, L. Dobson, A. Del Conte, D. Clementel, E. Salladini, E. Leonardi, F. Kordevani, H. Ghafouri, L.G. Tenorio Ku, A.M. Monzon, C. Ferrari, Z. Kálmán, J. Nilsson, J. Santos, C. Pintado-Grima, S. Ventura, V. Ács, R. Pancsa, M.G. Kulik, M.A. Andrade-Navarro, P.J.B. Pereira, S. Longhi, P. Le Mercier, J. Bergier, P. Tompa, T. Lazar.

[6] Mier, P., L. Paladin, S. Tamana, S. Petrosian, B. Hajdu-Soltész, A. Urbanek, A. Gruca, D. Plewczynski, M. Grynberg, P. Bernadó, Z. Gáspári, C. Ouzounis, V.J. Promponas, A.V. Kajava, J.M. Hancock, S. Tosatto, Z. Dosztanyi, and M.A. Andrade-Navarro. 2020. Disentangling the complexity of low complexity proteins. Briefings in Bioinformatics. 21, 458-472.

[7] Kamel, M., P. Mier, A. Tari and M.A. Andrade-Navarro. 2019. Repeatability in protein sequences. Journal of Structural Biology. 208, 86-91. [RES]

[8] Mier, P. and M.A. Andrade-Navarro. 2020. Assessing the low complexity of protein sequences via the low complexity triangle. PLoS One. 15, e0239154. [LCT]

[9] Jarnot, P., J. Ziemska-Legiecka, L. Dobson, M. Merski, P. Mier, M.A. Andrade-Navarro, J.M. Hancock, Z. Dosztányi, L. Paladin, M. Necci, D. Piovesan, S.C.E. Tosatto, V.J. Promponas, M. Grynberg and A. Gruca. 2020. PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins. Nucleic Acids Research. 48, W77-W84. [PlaToLoCo]

[10] Kastano, K., P. Mier and M.A. Andrade-Navarro. 2021. The role of low complexity regions in protein interaction modes: an illustration in Huntingtin. Int. J. Mol. Sci. 22, 1727.

[11] Mier, P. and M.A. Andrade-Navarro. 2021. The conservation of low complexity regions in bacterial proteins depends on the pathogenicity of the strain and subcellular location of the protein. Genes. 12, 451.

[12] Mier, P. and M.A. Andrade-Navarro. 2022. Regions with two amino acids in protein sequences: a step forward from homorepeats into the low complexity landscape. Comp. Struct. Biotechnol. J. 20, 5516-5523.

[13] Mier, P. and M.A. Andrade-Navarro. 2023. The nucleotide landscape of polyXY regions. Comp. Struct. Biotech. J. 21, 5408-5412.

[14] Gonçalves-Kulik, M., P. Mier, K. Kastano, J. Cortés, P. Bernadó, F. Schmid and M.A. Andrade-Navarro. 2022. Low complexity induces structure in protein regions predicted as intrinsically disordered. Biomolecules. 12, 1098.

[15] Gonçalves-Kulik, M., F. Schmid and M.A. Andrade-Navarro. 2023. One step closer to the understanding of the relationship IDR-LCR-structure. Genes. 14, 1711.

[16] Gonçalves-Kulik, L.A. Baptista, M., F. Schmid and M.A. Andrade-Navarro. 2025. Assessing the helical stability of polyXYs at the boundaries of intrinsically disordered regions with MD simulations. Comput. Struct. Biotech. Reports. 2, 100054.

[17] Mier, P. and M.A. Andrade-Navarro. 2021. Avoided Motifs: short amino acid strings missing from protein datasets. Biol. Chem. 402, 945-951.