## Proteins in bioinformatics - Introduction to my classes - Differences between Bioinformatics and Computational Biology - AI from Deep Blue to AlphaFold - The protein folding problem --- ### Protein coding genes in genomic sequences and annotation of proteins - DNA, proteins: length scale - Protein length - Protein mass - Bioinformatics works with life, humans and computers - Storing information: - strings of bits (computers) - DNA sequences - protein sequences - applications in biology: - a DNA storage device - FASTA file format, MultiFASTA file format - UniProt - UCSC Genome Browser - BLAT - Three simple tasks: - taxonomy ID / Lineage (NCBI/UniProt) - spike glycoprotein (SARS-CoV-2): see it in its reference genome - compare the sequence (SARS and SARS-CoV-2) of a protein - How to (Python): - protein length / proteome length - amino acid mass - proteome mass / average protein mass --- ### Homology - Homology intro: homology, similarity, analogy (examples) - Protein: relation between sequence, structure and function - The Rost curve - Gene duplication, speciation - Paralogy (gene duplication), orthology (speciation) - Convergent evolution - Horizontal gene transfer - BLAST, BLASTP - First protein sequence against a database search (Russell F. Doolittle) --- ### Multiple sequence alignment (MSA) - Introduction to alignment - Classical alignment representation: `*` (identity), `:` (conserved), `.` (semi-conserved), blank ` ` (mismatch) - An application of sequence alignment to epigenetics: bisulphite sequencing - Introduction to sequence alignment - Pairwise alignment vs. multiple sequence alignment: why is it good to perform an MSA? - Pairwise alignment: a brute-force algorithm - Basic metrics: Hamming, Levenshtein - Scoring schemes, substitution matrices (Dayhoff PAM, BLOSUM) - Gaps (indels) - Classification of algorithms: global alignments (Needleman–Wunsch) vs. local alignments (Smith–Waterman) - Multiple alignment implies pairwise alignment - Pairwise alignment does not imply multiple alignment - Pairwise alignment: different methods and applications - Dynamic programming: from the Manhattan graph problem to sequence alignment - Combinatorial optimization: - Seven bridges of Königsberg (graphs) - Travelling Salesman Problem (TSP) - Manhattan tourist problem - Dynamic programming computational complexity and the necessity of using heuristics. What is a heuristic? - Word methods for pairwise alignment: BLAST, FASTA - Sequence alignment profiles, sequence logos - Can we align a profile against a sequence? Can we align a profile against a profile? - Multiple sequence alignment: the algorithm --- ### Phylogeny - Taxonomy, evolution and phylogeny: definitions - Homology, similarity, clustering and hierarchical clustering - Phylogenetic tree (data structure) - Phylogenetic trees (unrooted and rooted, examples) - Earlier phylogenetic studies from the molecular evolution perspective: - phylogenetic study on fishes (Reichert and Brown 1909) - divergence between chimp and humans - the 3rd kingdom of life: Archaea (Woese) - modern mitochondrial DNA in forensic studies - What is a tree? (from graphs) - Building phylogenetic trees: phenetic and cladistic approach - UPGMA method - Maximum parsimony method - Varying rates of evolution (lizards, crocodiles and birds) - TimeTree: the timescale of life --- ### Protein structure - Central dogma of Molecular Biology - Amino acids, polypeptides - Chemical properties of the amino acids, classification - Protein structure (different chemical interactions, dissociation energies) - Protein structure: primary, secondary, tertiary, quaternary structures - Protein folding problem: 1. Levinthal's paradox 2. Anfinsen's dogma - Relation between protein sequence, structure and function --- ### PDB (Protein Data Bank) - Intro to the Data Bank - Evolution of the database - Current state, different experimental method contributions - Computed structural models (AlphaFold) - X-ray crystallography - Nuclear magnetic resonance - Electron microscopy - RCSB PDB: what can I retrieve from a PDB ID? - PDB text file description - Graphical tools for visualisation: Chimera