Protein motif and composition analysis

Conservation of tyrosine phosphorylation sites

In an approach to evaluate human protein kinases in a high-throughput fashion, we used yeast, which lacks a system for the phosphorylation of protein tyrosines, as an in vivo model system [1]. Expressing individual non-receptor tyrosine kinases (NRTKs) in yeast leads to the tyrosine phosphorylations of yeast proteins. This reproduces NRTK activity in human cells on the corresponding conserved orthologs. We used conservation analysis of putative sites of tyrosine phosphorylation (constrasting orthologs in organisms with NRTKs versus those without) to evaluate and select these candidates. Network analysis shows that individual NRTKs phosphorylate proteins that interact with each other, suggesting that motifs for the recognition of protein phosphorylation sites play a lesser role than previously thought. We predict relations between NRTKs and more than 3500 human target proteins.

Comparison of amino acid composition between protein domains and linkers

While it is well known that there is variability of amino acid composition in proteins across taxa, these differences in terms of protein domains and linkers have been less studied. We studied these using 38 proteomes [2]. The usage of polar residues in linkers and hydrophobic residues in globular domains was observed as expected. Focusing on particular types of domains can be more insightful. For example, while Arg usage in DNA-binding domains is high, their surrounding linkers are enriched in Ser, which are often target of phosphorylation in disordered regions. We created an R script to facilitate and visualize these analyses (RACCOON).

Low complexity regions

We have reviewed the definition of low complexity regions (LCRs) in sequences [3]. We focused particularly in the methods to measure LCRs, and in the relation of LCRs to composition bias, repeats, disorder and structure. For example, while compositional bias tends to be associated to LCRs and disorder, short repeats, which are compositionally biased, can induce structure. We use a series of examples to illustrate these overlapping aspects. In this respect, we developed a method and an associated web tool to provide a visualization of the "repeatability" of a protein sequence (RES, [4]). We defined this for a window as the fraction of residues that do not need to be changed for the sequence to be composed of perfect repeats. Application of the method to complete proteomes suggests intriguing differences between species regarding the repeatability of their sequences, e.g. a depletion in repeats of odd lengths in Saccharomyces cerevisiae and a few oher species, and a large number of repeats of length 2 and 7 in Danio rerio and Arabidopsis thaliana, respectively. We developed further the concept of the "low complexity triangle" as a means to represent the balance between repeatability and bias of LCRs in proteins [5]. We implemented the possibilty to generate such a visualization in a web tool (LCT).

We contributed to the creation of the first meta web-server for the analysis of low complexity regions in protein sequences (PlaToLoCo, [6]).

Intrinsically disordered regions

Intrinsically disorderd regions (IDRs) in protein sequences are defined by their lack of structure. They might correspond to CBRs but this is not necessarily so. We provided a protocol to study them from an evolutionary point of view in two steps [7]. First, we profiled IDR content for more than 10,695 available complete proteomes. We could observe certain increase of disorder content with number of cell types within vertebrates. But also that there are particular single celled species with remarkably high content of disorder, suggesting functional and environmental reasons for the presence of disorder in proteins. In a second, focused approach, we compared disorder across orthologs of human, mouse, fish, fly and yeast to find signatures of disorder evolution. We found a tendency in subunits from complexes to share gain of IDRs stressing the function of disorder in the modulation of protein interactions. Compositional analysis of these orthologs suggested that widely conserved CBRs are E- and K-rich while Q- and A-rich regions are more species specific.



[1] Corwin, T., J. Woodsmith, F. Apelt, J.F. Fontaine, D. Meierhofer, J. Helmuth, A. Grossmann, M.A. Andrade-Navarro, B.A. Ballif and U. Stelzl. 2017. Defining human tyrosine kinase phosphorylation networks using yeast as an in vivo model substrate. Cell Systems. 5, 128-139.

[2] Brüne, D., M.A. Andrade-Navarro and P. Mier. 2018. Proteome-wide comparison between the amino acid composition of protein domains and linkers. BMC Research Notes. 11, 117.

[3] Mier, P., L. Paladin, S. Tamana, S. Petrosian, B. Hajdu-Soltész, A. Urbanek, A. Gruca, D. Plewczynski, M. Grynberg, P. Bernadó, Z. Gáspári, C. Ouzounis, V.J. Promponas, A.V. Kajava, J.M. Hancock, S. Tosatto, Z. Dosztanyi, and M.A. Andrade-Navarro. 2020. Disentangling the complexity of low complexity proteins. Briefings in Bioinformatics. 21, 458-472.

[4] Kamel, M., P. Mier, A. Tari and M.A. Andrade-Navarro. 2019. Repeatability in protein sequences. Journal of Structural Biology. 208, 86-91. [RES]

[5] Mier, P. and M.A. Andrade-Navarro. 2020. Assessing the repeatability of protein sequences via the low complexity triangle. PLoS One. In press. [LCT]

[6] Jarnot, P., J. Ziemska-Legiecka, L. Dobson, M. Merski, P. Mier, M.A. Andrade-Navarro, J.M. Hancock, Z. Dosztányi, L. Paladin, M. Necci, D. Piovesan, S.C.E. Tosatto, V.J. Promponas, M. Grynberg and A. Gruca. 2020. PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins. Nucleic Acids Research. 48, W77-W84. [PlaToLoCo]

[7] Kastano, K., G. Erdős, P. Mier, G. Alanis-Lobato, V.J. Promponas, Z. Dosztányi and M.A. Andrade-Navarro. 2020. Evolutionary study of disorder in protein sequences. Biomolecules. 10, 1413.