by anonymous master student
As part of our routine work we often obtain results by analysis of biological data that need to be assessed for significance with a statistical test. For example, say that we observe that a percentage of human genes related to a particular function, e.g. kinases, have a DNA motif in their promoters, e.g. GATTACA. If this DNA motif occurs very often by chance in the genome then the result may not be signicant: it could due to chance. We might test this by randomizing the genome and counting again how many times the DNA motif GATTACA appears in front of kinase encoding genes: this is our null-hypothesis. If we repeat this many times, the fraction of times we observe that the motif is equally or more frequent becomes a P-value that indicates the probability of the observation being due to chance. That's why we need to simulate the null-hypothesis in the computer and wait. And if the p-value is > 0.005 we are unhappy.
In this diagram each dot or plus sign represents a sample of cells, hES for human embryonic stem cells, iPS for induced pluripotent stem cells, Fib for fibroblasts. The position of the samples in the graph indicates in a simplified way how similar they are regarding their gene expression: close points have similar gene expression. Here, we would see that the iPS samples would have similar gene expression to hES, which would fit a reprogramming goal. However, the three samples in the box with the arrow would be somehow the result of fibroblasts that reprogrammed half way to iPS. Further efforts would be needed to understand which went wrong and what can be optimized. We are looking for genes and chemicals allowing us to reprogram cells from one type to another in a test-tube. You can read more about this in a related publication:
Cheng, X., H. Yoshida, D. Raoofi, S. Saleh, H. Alborzinia, F. Wenke, A. Göhring, S. Reuter, N. Mah, H. Fuchs, M.A. Andrade-Navarro, J. Adjaye, S. Gul, J. Utikal, R. Mrowka and S. Wölfl. 2015. Ethyl 2-((4-chlorophenyl)amino)thiazole-4-carboxylate and derivatives are potent inducers of Oct3/4. J. Med. Chem. 58, 5742-5750. PubMed: 26143659
This is a draft of a flow chart for the selection of proteins to train a Support Vector Machine. The appropriate choice of examples for training and becnhmark is a very important initial step in the development of methods for prediction in computational biology. In this case, we were working on a method to detect protein subcellular location based on amino acid composition and exposure. Therefore, we thought to start with all human protein sequences associated to genes in the Entrez NCBI database, then to take one associated to each human gene, for which we can obtain functional (GO) information, including subcellular location, and then to take those whose structure is deposited in the PDB database, provided that the solved structure had more than 150 amino acids. You can read the rest of the story in the related publication:
The diagram displays simplified representations of protein domains with tandem repeats that were detected using a neural network. Sequence similarity analyses of these domains suggest a complex story of duplications and rearrangements of protein fragments. These mechanisms are used to increase protein variability and function. Read more about this in the related publication:
Palidwor, G.A., S. Shcherbinin, M.R. Huska, T. Rasko, U. Stelzl, A. Arumughan, R. Foulle, P. Porras, L. Sanchez-Pulido, E.E. Wanker, M.A. Andrade-Navarro. 2009. Detection of alpha-rod repeats using a neural network and application to huntingtin. PLoS Comp. Biol. 5, e1000304. [ARD] PubMed: 19282972