Protein motif and composition analysis

Conservation of tyrosine phosphorylation sites

In an approach to evaluate human protein kinases in a high-throughput fashion, we used yeast, which lacks a system for the phosphorylation of protein tyrosines, as an in vivo model system [1]. Expressing individual non-receptor tyrosine kinases (NRTKs) in yeast leads to the tyrosine phosphorylations of yeast proteins. This reproduces NRTK activity in human cells on the corresponding conserved orthologs. We used conservation analysis of putative sites of tyrosine phosphorylation (constrasting orthologs in organisms with NRTKs versus those without) to evaluate and select these candidates. Network analysis shows that individual NRTKs phosphorylate proteins that interact with each other, suggesting that motifs for the recognition of protein phosphorylation sites play a lesser role than previously thought. We predict relations between NRTKs and more than 3500 human target proteins.

Comparison of amino acid composition between protein domains and linkers

While it is well known that there is variability of amino acid composition in proteins across taxa, these differences in terms of protein domains and linkers have been less studied. We studied these using 38 proteomes [2]. The usage of polar residues in linkers and hydrophobic residues in globular domains was observed as expected. Focusing on particular types of domains can be more insightful. For example, while Arg usage in DNA-binding domains is high, their surrounding linkers are enriched in Ser, which are often target of phosphorylation in disordered regions. We created an R script to facilitate and visualize these analyses (RACCOON).

Intrinsically disordered regions

Intrinsically disordered regions (IDRs) in protein sequences are defined by their lack of structure. They might correspond to compositionally biased regions (CBRs) but this is not necessarily so. We provided a protocol to study them from an evolutionary point of view in two steps [3]. First, we profiled IDR content for more than 10,695 available complete proteomes. We could observe certain increase of disorder content with number of cell types within vertebrates. But also that there are particular single celled species with remarkably high content of disorder, suggesting functional and environmental reasons for the presence of disorder in proteins. In a second, focused approach, we compared disorder across orthologs of human, mouse, fish, fly and yeast to find signatures of disorder evolution. We found a tendency in subunits from complexes to share gain of IDRs stressing the function of disorder in the modulation of protein interactions. Compositional analysis of these orthologs suggested that widely conserved CBRs are E- and K-rich while Q- and A-rich regions are more species specific.

Low complexity regions

We have reviewed the definition of low complexity regions (LCRs) in sequences [4]. We focused particularly in the methods to measure LCRs, and in the relation of LCRs to composition bias, repeats, disorder and structure. For example, while compositional bias tends to be associated to LCRs and disorder, short repeats, which are compositionally biased, can induce structure. We use a series of examples to illustrate these overlapping aspects. In this respect, we developed a method and an associated web tool to provide a visualization of the "repeatability" of a protein sequence (RES, [5]). We defined this for a window as the fraction of residues that do not need to be changed for the sequence to be composed of perfect repeats. Application of the method to complete proteomes suggests intriguing differences between species regarding the repeatability of their sequences, e.g. a depletion in repeats of odd lengths in Saccharomyces cerevisiae and a few other species, and a large number of repeats of length 2 and 7 in Danio rerio and Arabidopsis thaliana, respectively. We developed further the concept of the "low complexity triangle" as a means to represent the balance between repeatability and bias of LCRs in proteins [6]. We implemented the possibility to generate such a visualization in a web tool (LCT).

We contributed to the creation of the first meta web-server for the analysis of low complexity regions in protein sequences (PlaToLoCo, [7]).

We applied an evolutionary approach for the study of LCRs involvement in protein interactions using Huntingtin (HTT) [8]. We collected the LCRs from HTT interactors: coiled coils, IDRs, homorepeats (polyX) and CBRs, and analysed their conservation across orthologs and their co-occurrence to identify possible LCR-mediated modes of protein interaction. We study the currently only known structure of HTT, which is in complex with HAP40, from this perspective and infer that proteins RASA1, SYN2 and KAT2B may bind to HTT using LCRs as HAP40 does.

LCRs are less frequent and less conserved in prokaryotes compared to eukaryotes and their function in those organisms is under debate. Since bacteria can have very fast rates of evolution we wanted to examine how LCRs behave in this short evolutionary range to detect possible LCR conservation indicative of function. For this we compared two types of amino acid enriched LCRs (compositionally biased regions and homorepeats) across orthologs of bacterial strains [9]. We found that Q-rich CBRs are the most conserved and that A-rich CBRs and polyA are extremely variable. The abundance of LCRs is higher in extracellular and outer-membrane proteins. Regarding conservation of LCRs in the proteins of pathogens: these appear to be more conserved in extracellular proteins but polyX in particular seem to be very variable in outer membrane proteins.

Avoided motifs

We detected and categorized the shortest strings of amino acids absent from large protein datasets such as the human proteome, considering proteins in particular subcellular locations, and in all the proteins from bacteria [10]. Our hypothesis is that some of these strings (which are four amino acids long) represent functional motifs that are negatively selected in particular contexts. As an example, we show that the sequence "WEWW", avoided in human and generally in eukaryotic organisms, corresponds to part of the signature of bacterial thiol activated cytolysins, which are secreted by pathogenic bacteria, suggesting that there is defense mechanism targeting this motif.


[1] Corwin, T., J. Woodsmith, F. Apelt, J.F. Fontaine, D. Meierhofer, J. Helmuth, A. Grossmann, M.A. Andrade-Navarro, B.A. Ballif and U. Stelzl. 2017. Defining human tyrosine kinase phosphorylation networks using yeast as an in vivo model substrate. Cell Systems. 5, 128-139.

[2] Brüne, D., M.A. Andrade-Navarro and P. Mier. 2018. Proteome-wide comparison between the amino acid composition of domains and linkers. BMC Research Notes. 11, 117.

[3] Kastano, K., G. Erdős, P. Mier, G. Alanis-Lobato, V.J. Promponas, Z. Dosztányi and M.A. Andrade-Navarro. 2020. Evolutionary study of disorder in protein sequences. Biomolecules. 10, 1413.

[4] Mier, P., L. Paladin, S. Tamana, S. Petrosian, B. Hajdu-Soltész, A. Urbanek, A. Gruca, D. Plewczynski, M. Grynberg, P. Bernadó, Z. Gáspári, C. Ouzounis, V.J. Promponas, A.V. Kajava, J.M. Hancock, S. Tosatto, Z. Dosztanyi, and M.A. Andrade-Navarro. 2020. Disentangling the complexity of low complexity proteins. Briefings in Bioinformatics. 21, 458-472.

[5] Kamel, M., P. Mier, A. Tari and M.A. Andrade-Navarro. 2019. Repeatability in protein sequences. Journal of Structural Biology. 208, 86-91. [RES]

[6] Mier, P. and M.A. Andrade-Navarro. 2020. Assessing the repeatability of protein sequences via the low complexity triangle. PLoS One. 15, e0239154. [LCT]

[7] Jarnot, P., J. Ziemska-Legiecka, L. Dobson, M. Merski, P. Mier, M.A. Andrade-Navarro, J.M. Hancock, Z. Dosztányi, L. Paladin, M. Necci, D. Piovesan, S.C.E. Tosatto, V.J. Promponas, M. Grynberg and A. Gruca. 2020. PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins. Nucleic Acids Research. 48, W77-W84. [PlaToLoCo]

[8] Kastano, K., P. Mier and M.A. Andrade-Navarro. 2021. The role of low complexity regions in protein interaction modes: an illustration in Huntingtin. Int. J. Mol. Sci. 22, 1727.

[9] Mier, P. and M.A. Andrade-Navarro. 2021. The conservation of low complexity regions in bacterial proteins depends on the pathogenicity of the strain and subcellular location of the protein. Genes. 12, 451.

[10] Mier, P. and M.A. Andrade-Navarro. 2021. Avoided Motifs: short amino acid strings missing from protein datasets. Biol. Chem. 402, 945-951.