Make sure that R and RStudio are installed in your computer. You can create a new R Script in RStudio to copy example source code from this tutorial and write code for the exercises (when you open RStudio the console will be open. Make sure you go to file and create a script and work from there). Save the script in your home directory (or a sub directory of your choice). To run the code written on the current line (or of selected lines), click on button Run or press CTRL+ENTER. Results will be displayed either in the RStudio’s console or the Viewer panel (graphics). To read the documentation about a function (Help panel), in the script, place the cursor on the function’s name and press F1. You should use this feature extensively here, as many answers to possible questions are hidden here :)
The packages we’ll need can be installed with the following code: Note the comment lines, which can be used in R code to tell the interpreter to not read the text as code. They will will tell you a bit more about what you are installing! :)
Rlib = "~/Rlib"
.libPaths(Rlib)
# now R should install into your custom lib and also read from there. If it does
#not work, specify the lib for installing and loading like this, and do not update
#the standard packages if possible (choose n when asked "Update a/s/n?")
#otherwise it is possible to install the standard r libraries into your custom lib,
#but I hope that is not necessary
install.packages("tidyverse", lib=Rlib)
library(tidyverse, lib=Rlib)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager", lib = "~/Rlib")
library(BiocManager, lib.loc = "~/Rlib")
BiocManager::install(c(
"limma",
"EnhancedVolcano",
"ComplexHeatmap",
"fgsea",
"sva",
"pheatmap",
"biomaRt",
"msigdbr"
), lib = "~/Rlib")
BiocManager::install("sva", lib = "~/Rlib", dependencies = TRUE, force=TRUE)
BiocManager::install("GenomeInfoDbData", lib = "~/Rlib")
BiocManager::install("genefilter", lib = "~/Rlib")
install.packages("tidyverse", lib=Rlib)
library(tidyverse, lib.loc = "~/Rlib")
library(limma, lib.loc = "~/Rlib") # differential analysis
library(stats, lib.loc = "~/Rlib") # standard R pkg for statistics, we'll use its PCA implementation
library(sva, lib.loc = "~/Rlib") # batch effect stuff
library(pheatmap, lib.loc = "~/Rlib")
library(msigdbr, lib.loc = "~/Rlib")
# library(biomaRt) #if u load it earlier there is a conflict with select()
library(fgsea, lib.loc = "~/Rlib")
## Warning: package 'tidyverse' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: mgcv
##
## Loading required package: nlme
##
##
## Attaching package: 'nlme'
##
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
##
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
##
## Loading required package: genefilter
##
##
## Attaching package: 'genefilter'
##
##
## The following object is masked from 'package:readr':
##
## spec
##
##
## Loading required package: BiocParallel
## Warning: package 'pheatmap' was built under R version 4.4.2
## Warning: package 'msigdbr' was built under R version 4.4.2
Because of the ever-growing number of publicly available and large datasets in life sciences, computational and statistical skills are becoming a key part of the life scientist’s curriculum. This tutorial aims to give you an overview on how to load and convert data in R and how to do exploratory analysis, prior to delving deeper into downstream analysis of differential expression and gene set enrichments.
This tutorial uses the backbone of Dr. Jean-Fred Fontaine’s R data analysis tutorial, taught in out other modules.
While microarrays are still in use and can be especially useful when interested in a specific set of genes (Custom Arrays), RNA-seq is the main sourch for expression data nowadays. Nevertheless, many techniques, use for RNA-seq count data, in this course are directly - or with minor changes - applicabkle to microarray data too.
The basic workflow for RNA-seq and computational analysis can be seen in the figures below:
Figure 1:Rouch scheme of a wet-lab
workflow for RNA-seq. RNA is isolated from the source, fragmented and
subsequently converted with the help of reverse transcriptase to cDNA.
From here standard massively parallel sequencing techniques can be used:
Adapters are added to the DNA strands, which are used internally in the
sequencer and can differ depending on the sequencing chemistry used. The
result will be provided in the form of a FASTQ file, that contains
millions of short sequences, termed “fragments” or “reads”. The color
code shows to which steps the illustration on the right roughly belongs.
Note that RNA extraction and selection are critical points in the
process, that can alter the results significantly, by changing the input
amounts and kind of RNA. Adapted from: https://microbenotes.com/rna-sequencing-principle-steps-types-uses/
Figure 2:Quality control ensures, that the wet-lab
experiments, worked as intended and can give the analyst insights on how
to handle the data. If there are problematic samples, they can be
removed to stop them from biasing the results, certain types of
artifacts, namely batch effects, can also sometimes be detected during
this phase and taken into account downstream. The subsequent mapping
step holds much information about the quality of the data and should be
considered if compute is not too demanding. Trimming can be used to
increase the coverage of reads on the reference genome and to filter out
faulty reads, increasing the overall quality and usability of a sample.
For RNA-seq it is often not as important as for other functional
genomics assays, though. The mapping step is one of the most important
in the pipeline. The reads saved in the FASTQ file are mapped, or
aligned, to a reference genome. That means, an algorithm searches the
whole genome for the best fit of the sequence of the given read. This
step is computationally demanding as millions of reads need to be
aligned to a sequence of millions of nucleotides. The resulting BAM
files can be annotated and the reads that fall on the locus of a gene be
counted, resulting in a gene count matrix. We will use such a matrix,
for which the preprocessing was already done in this course. Source: https://biocorecrg.github.io/RNAseq_course_2019/alignment.html
Automated Quality Control with ML: If you are interested in the quality control of FASTQ files, inRNA-seq as well as other functional genmoics data, have a look at our work predicting sample quality with machine learning: Quality prediction: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02294-2 Batch effect detection with the same software: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04775-y
The R language comes with many functions already working, such as arithmetic and logical operators, built-in functions like dim(), str(), min(), max(), mean() and much more. As a programming language it also features control structures like loops and allows you to write your own functions. It is an object-oriented languae that allows, which allows to pack information from multiple sources into one object, which is often utilized as such by the bioinformatics packages and which we will also encounter.
The base language of R was meant for statistical programming and data visualization, but has been expanded by its community with packages for multiple scietific and programmatic purposes. Especially, for data science, machine learning and bioinformatics. Most packages that handle bioinformatics are available through BioConductor a project dedicated to building and maintaining bioinformatic software:
The mission of the Bioconductor project is to develop, support, and disseminate free open source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assays. We are dedicated to building a diverse, collaborative, and welcoming community of developers and data scientists.
To use the packages mentioned above, we’ll install them first. You should have tried it already in the prerequisites section. If it worked out, we can load the packages - or libraries - by calling the library() function.
Top run any analysis, we first need our data of course. We’ll get our data from an already prepossessed and curated dataset source called BARRA:CuRDa - weird name I know - but we’ll not take the complete set but only choose some sets. The database, with it’s downloadable datasets and the paper can be found here: https://sbcb.inf.ufrgs.br/barracurda. The authors of the database and paper, have used the following workflow, you’ll notice some overlaps with what was discussed in the Introduction.
The dataset was picked and curated to be optimal for machine learning and benchmarked for multiple algorithms. We’ll explore some of that in the second week of the course and compare our differential genes with what the ML algorithms deem as important in our data. First, we’ll focus on exploratory and differential analysis though. We’ll start out with the Head and Neck Cancer datasets GSEGSE64912 and GSEGSE68799. you can get them from their site: https://sbcb.inf.ufrgs.br/barracurda Or in case the site is down (which happened before…) you can download them from my google drive: https://drive.google.com/file/d/1GWWB_8F_pbXzEPX4p6YDo-Mm3KjlFu39/view?usp=sharing https://drive.google.com/file/d/1-04BZvr_tFaiQNM-QqO6nLlOMbJTmgIc/view?usp=drive_link
To separate the content of each cell in a file, people use special characters like commas, tabs, pipes, etc. The extension of plain-text files is often a good hint of what delimiter has been used. For example, csv files should use “commas” as the delimiter, whereas tsv files should use “tabs”. Nevertheless, you can find csv or tsv files that are delimited with different characters.
The first step in data import is locating the file in your computer or in a remote location (e.g. by URL). The most important parameter for readr’s functions is the path to the file of interest.
In this part, you are going to import files located on your machine that you either downloaded from the link above or froim your course files. The tidyverse library must be loaded.
Before actually importing the data: We need to check where our current working directory is and if our files are in a relative path or not. If it is not the case we can either choose to include their whole path or set up our working directory in such a way, that we can work from the dedicated folder for the project and have our data and output in the same directory as sub directories. When we set up our working directory, we can download the data files and deposit them in the data/input folder.
# returns the current working dir
getwd()
setwd("~/")
# sets the current working dir to its input
setwd(getwd())
# setwd() needs an input, otherwise throws an error! :)
# setwd(getwd()) sets the current working directory as the new working directory, so it does nothing
# file.path() is a helper function, that glues together multiple input strings to a file path.
# to set a completely new path:
# either
# setwd(file.path("C:", "Some", "random", "path"))
# or
# setwd("C:/some/random/path")
# you can use the file browser in RStudio as shown below as well :)
# set a path relative to the current working directory, and call it whatever makes sense to you
# for example:
dir.create("bioinfo-module/IamForcedToLearnR/IamInRjail/SendHelp", recursive = TRUE)
setwd(file.path(getwd(), "bioinfo-module"))
To set up your working dir, you can either work in your respective OS and create a folder called “Bioinfo-module” and set this as the wording dir with the setwd() function.
Or you can create it from RStudio directly, with the dir.create() function.
Now we can set up our input and output directories, which we’ll use to have a bit of structure in our working directory.
# assign variables to names of out folders (these could also be whole paths, pointing somewhere else entirely!)
out.dir <- "output"
in.dir <- "data"
# Create the directories only if they don't exist
if (!dir.exists(out.dir)) {
dir.create(out.dir)
}
if (!dir.exists(in.dir)) {
dir.create(in.dir)
}
Now you need to deposit the data in the dedicated folder. Note that these files are zipped and need to be unzipped in their destination folder (smaller files can be read in zipped, however files of these size take their time and I don’t want to risk it on the NatFak machines :D).
First, we’ll import a single file
## Rows: 45 Columns: 58737
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): ID, class
## dbl (58735): ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 45 × 58,737
## ID class ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 SRR201… Tumor 9.34 1.72 9.41 9.33
## 2 SRR201… Tumor 9.02 1.72 9.48 9.95
## 3 SRR201… Norm… 7.12 2.52 10.1 9.01
## 4 SRR201… Tumor 9.15 2.57 9.53 9.61
## 5 SRR201… Tumor 10.1 2.81 9.70 9.44
## 6 SRR201… Tumor 9.58 2.40 9.41 9.64
## 7 SRR201… Tumor 10.1 1.72 9.62 9.26
## 8 SRR201… Tumor 10.7 2.56 10.6 9.77
## 9 SRR201… Tumor 9.62 1.72 10.2 9.48
## 10 SRR201… Tumor 9.82 1.72 10.4 9.65
## # ℹ 35 more rows
## # ℹ 58,731 more variables: ENSG00000000460 <dbl>, ENSG00000000938 <dbl>,
## # ENSG00000000971 <dbl>, ENSG00000001036 <dbl>, ENSG00000001084 <dbl>,
## # ENSG00000001167 <dbl>, ENSG00000001460 <dbl>, ENSG00000001461 <dbl>,
## # ENSG00000001497 <dbl>, ENSG00000001561 <dbl>, ENSG00000001617 <dbl>,
## # ENSG00000001626 <dbl>, ENSG00000001629 <dbl>, ENSG00000001630 <dbl>,
## # ENSG00000001631 <dbl>, ENSG00000002016 <dbl>, ENSG00000002079 <dbl>, …
Well that worked out well! Note how simple the syntax for data import is: you choose the name of a variable to store your table, use the <- operator (written with 2 characters: < and -), call the appropriate readr function and specify the location of your file. So now we’ll get all files in the input folder and read in each of them. additionally, we’ll save the names of the respective files, so that we can keep that info for later.
# get the file names in the directory with the csv ending, note that this will only return the names of the file not the whole paths
# (which we could do though! Check out the functions vignette, by pressing F1, when the cursor is on its text.)
csv_files <- list.files(path = file.path(in.dir), pattern = "\\.csv$")
# now we could write a for loop, to iterate over the list and get each file.
# We'd need to declare a list-object before that has the right size and
# than could put each table into the given list.
# However, loops are not very efficient (especially in R).
# the tidyverse has a solution for us:
# the purrr package provides the map() function which allows us to vectorize operations.
data_list <- map(file.path(in.dir, csv_files), read_csv)
## Rows: 22 Columns: 58737
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): ID, class
## dbl (58735): ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 45 Columns: 58737
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): ID, class
## dbl (58735): ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# assign names to list
names(data_list) <- str_remove_all(csv_files, ".csv")
# loop over list and add a column containin the dataset name!
# we'll again use the a function instead of writing a loop :)
# this time it's imap(), which also passes the name of a given element not only its contents.
data_list <- imap(data_list, ~ .x %>% mutate(dataset=.y))
# now we can put them together!
combined_data <- bind_rows(data_list)
combined_data
## # A tibble: 67 × 58,738
## ID class ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 SRR175… Tumor 11.1 2.97 9.99 9.45
## 2 SRR175… Tumor 10.9 3.07 9.80 9.49
## 3 SRR175… Tumor 11.5 2.69 9.81 9.85
## 4 SRR175… Tumor 10.8 4.29 9.85 9.29
## 5 SRR175… Tumor 11.6 3.28 10.3 10.1
## 6 SRR175… Tumor 10.7 3.16 9.70 9.60
## 7 SRR175… Tumor 10.5 3.32 9.76 9.40
## 8 SRR175… Tumor 10.8 2.72 9.91 9.68
## 9 SRR175… Norm… 11.3 3.47 10.2 9.76
## 10 SRR175… Norm… 11.9 3.67 10.7 10.3
## # ℹ 57 more rows
## # ℹ 58,732 more variables: ENSG00000000460 <dbl>, ENSG00000000938 <dbl>,
## # ENSG00000000971 <dbl>, ENSG00000001036 <dbl>, ENSG00000001084 <dbl>,
## # ENSG00000001167 <dbl>, ENSG00000001460 <dbl>, ENSG00000001461 <dbl>,
## # ENSG00000001497 <dbl>, ENSG00000001561 <dbl>, ENSG00000001617 <dbl>,
## # ENSG00000001626 <dbl>, ENSG00000001629 <dbl>, ENSG00000001630 <dbl>,
## # ENSG00000001631 <dbl>, ENSG00000002016 <dbl>, ENSG00000002079 <dbl>, …
Now that we have our data at hand, we’ll need to observe it. Use the head(), str(), summary() and dim() functions to get an overview of the data. Does anything strike you as unexpected? Try to also inspect other objects with the str() function: What kind of object are the paths to you in- and output-folders? What classes of other objects have we assigned toi variables until now?
What we have at our hands was formatted to be used with ML tools: It shows the variables (features) in the columns and the Observations (samples) in the rows (samples x features). This data format is particularly useful for plotting and most machine learning software, will also expect this format. However, in most other applications in R, like statistical functions and also the bioinformatic packages we’ll be using, the opposite is expected: features x samples
Assign a new variable called metadata that holds the general information about the samples. The metadata should be a dataframe, not a tibble and it should have the ID assigned as row.names(). Assign a new variable called counts that holds the numerical information about the samples. It should be gene names x sample names (so gene names as the rows, and sample names as the columns), the gene names should be assigned as row names. Use the select() function for both tasks. Tipp: you can select a column, by using - in front of the column name. Tipp: Use as.data.frame() and column_to_rownames() to accomplish this. Feel free to use the pipe operator %>% to select the data.
# Packages for the exercise: dplyr, functions: select(), column_to_rownames(), filter(), as.data.frame()
# The path to the solution can look as followed:
# 1 - we need the combined_data
# 2 - from the combined data we need to select the columns we need (ID, class, dataset)
# 3 - we want the ID's to be row name instead of a column
# 4 - the whole object should be transformed into a data frame
# Exercise 3.1: build a metadata table
metadata_combined <- ??? %>%
select(...) %>%
column_to_rownames(...) %>%
as.data.frame()
metadata_combined
# for the counts_combined data frame you need to select all columns but class and dataset,
# you can also use select(-x, -y) to select all columns but x and y.
# Additionally transpose the data using t(dataframe)
counts_combined <- ??? %>%
select(...) %>%
column_to_rownames(...) %>%
t()
as.data.frame()
build similar tables for the two datasets, so in the end two more counts dataframes and two more metadata data frames. To only use the data that has a specific entry in the column you can use filter(column == “Entry”)
Exploratory analysis is a critical first step in the interpretation of RNA-seq data, and when working with similar data organized in abundance/count/intensity matrices. This process is akin to an initial survey of the data landscape, aimed at understanding its basic characteristics, identifying potential outliers, and uncovering early patterns or trends. It can also include checking for batch effects, which are technical variations that can introduce unwanted noise into the data. In this respect it can also be used as the “last line” of quality control.
Visualization tools, such as heatmaps, box plots, and scatter plots, are commonly used in this stage to provide a graphical summary of the data. These can help researchers gain insights into the expression levels of different genes, the similarity between samples, and the overall structure of the data.
In summary, exploratory analysis of RNA-seq count data serves as a foundational step in the data analysis pipeline, setting the stage for more complex downstream analysis, such as differential expression analysis and pathway enrichment analysis. It helps ensure that the data is of high quality and that any subsequent conclusions drawn from the data are reliable and robust.
We’ll first get an overview of the data by plotting the first two components of a PCA. This will give us an overview of how similar the samples are to each other and if we have any batch effects to expect from the differing datasets.
A Principal Component Analysis, short PCA, is a (linear) dimension reduction technique and therefore can be counted to classical unsupervised machine learning. It’s been around since 1901 and basically computes linear combinations that represent the variance in the data best. To do that, it starts with the longest axis in the data, by minimizing the distance of each point in the data to the first linear combination it computes: The first principal component (PC). This GIF gives a good intuition whats happening under the hood:
Figure 4: Minimization of distances to a line in 2D space.
The process is repeated incrementally, but every following PC must be
orthogonal to all its predecessors, to ensure the PCs do not overlap in
information about the underlying variance of the data. The image below
shows three PCs fitted to an ellipsoid.
*Figure 5: Three PCs fitted to an elipsoid in 3D space, notice how the first PC is in parallel to the longest axis of the ellipsoid and the others are orthogonal to it and each other.
The thread I found these images in, is from crossvalidated, a sub-site of stackexchange, where one can often find answers to questions regarding statistics and machine learning. I recommend to check out the thread, it has a number of answers, ranging from very mathematical to much more intuitive explanations (as hopefully given above :D) Link: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues *
A scatterplot is the most simple plot you can come up with, it plots two numerical variables against each other leading to a 2D point cloud. We’ll plot the first two principal components against each other, as we’ll see in a minute.
Use the prcomp() function to compute the principal components of our expression data and save it in a variable. Assign the PCs to a new variable by calling the PCA-object you just saved, now add the metadata to it with the left_join() function. Tipp: as we mentioned PCA is a ML technique and most of the time the respective functions will expect the data to be features x samples, use t() to easily transpose the counts object. Tipp: there is also rownames_to_column()
# a) compute PCs
pca_obj <- prcomp(...)
df_plot <- pca_obj$x %>% as_tibble(rownames="ID")
# b) add metadata
# to add the metadata by a shared variable, we first add the metadata, then have to make the rownames into the
# 'ID' column again, and then merge the df_plot with left_join by their 'ID'
df_plot <- meta.data %>%
...
Now we’ll want to plot the first two principal components as a scatterplot. For this we’ll need functiuons from ggplot2, which is part of the tidyverse.
ggplot2 has its own pipe-operator to chain together functions that manipulate the final plot, which is +. .
To illustrate how ggplot works, we’ll plot the density of the first gene in our count Matrix. For this we’ll use our original combined_data variable, as its format allows us to include metadata without joining. We’ll load a library called scales that allows us to easily change the color of fill or color of our plots.
# we'll need this for easy different color scales :)
library(scales)
# Filter for the gene of interest
plt <- combined_data %>%
select(ID, class, dataset, ENSG00000000003) %>%
ggplot(aes(x = ENSG00000000003, fill = class, color = class)) +
geom_density(alpha = 0.5) +
labs(title = paste("Distribution of reads for", "ENSG00000000003"),
x = "Count",
y = "Density",
fill = "Sample",
color = "Sample") +
theme_bw()
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
Funny that we directly see a clear difference, that was actually not planned! Let’s have a look what changes, when we change either fill or color to dataset instead of class. You can add scale_fill_brewer(palette = “Dark2”) + scale_color_brewer(palette = “Set1”) to change the colors of fill and color respectively.
Now use what you saw above with your dataframe containing the PCs and metadata. Use the geom_point() function to plot a scatterplot! Save the plot with the ggsave() function in the out.dir.
# Exercise 2.2
# you can acess the columns of df_plot for x, y, class and shape like in the example for exercise 6
# then, to create a scatterplot, add geom_point()
plt <- df_plot %>%
ggplot(aes(...)) +
...
plt
What you can see, if you made it work, is a classic Batch effect! The main variance in the dataset (first PC) is explained by the difference ebtween the dataset. Feel free to experiment a bit and check how the PCA scatter plot looks like for the respective datasets alone!
Batch effects and other artifacts: Artifacts is a word for underlying factors that effect the data we are working with, but are not associated with the biological observation. By distorting the gene counts or introducing biases of other nature, they will impact the downstream analysis and lead to false results. The most abundant and impactful are the batch effects, which stem from technical differences in the experimental phase of an assay: Different Handlers, reagents, machines are just a few sources for them. Sometimes they correlate well with differences in quality of a sample, as we showed in one of the works of the Automated Quality Control box. We’ll see how to handle them in later sections.
Principal components can tell us a lot about the data, the more you need to adequately describe its variance, the more diverse it is. It is interesting for us to see how much variance is actually covered by the PCs. First we’ll get the standard deviations, which are stored in our PCA object inspect the object to see in which part of it they could be stored. Don’t forget that we’ll need to square the standard deviation to get the variance!
# first we'll get the standard deviations, which are stored in our PCA object
# inspect the object to see in which part of it they could be stored.
# don't forget that we'll need to square the standard deviation to get the variance!
# Then we Create a data frame for the scree plot
# the $ operator lets us use the contents of pca_obj in R
# 2nd tipp: the variance is the standard deviation^2
scree_data <- data.frame(
PC = paste("PC", 1:length(pca_obj$...)),
Variance = pca_obj$... ...
)
# we'll just take the first 20
scree_data <- scree_data[...,]
# now we can create a scree plot which basically plots the PCs in their order
# and the respective variance explained by them.
plt <- ggplot(scree_data, aes(x = reorder(PC, Variance, decreasing=T), y = Variance)) +
geom_bar(stat = "identity") +
labs(title = "Scree Plot",
x = "Principal Component",
y = "Variance Explained") +
theme_bw()
plt
Additionally we can compute the percentage of variance that each of the PCs explains as follows:
#hint: perhaps you already encountered these numbers when inspecting the pca_obj with summary?
percentage <- round(pca_obj$sdev^2 / sum(pca_obj$sdev^2) * 100, 2)
percentage
## [1] 42.12 5.76 4.14 3.15 2.47 2.29 2.06 1.74 1.64 1.45 1.39 1.25
## [13] 1.20 1.10 1.06 1.01 0.94 0.91 0.87 0.86 0.83 0.82 0.80 0.76
## [25] 0.75 0.72 0.71 0.69 0.68 0.66 0.63 0.61 0.60 0.58 0.56 0.56
## [37] 0.55 0.54 0.53 0.52 0.50 0.49 0.47 0.46 0.45 0.44 0.43 0.43
## [49] 0.42 0.41 0.40 0.39 0.38 0.36 0.36 0.34 0.33 0.31 0.31 0.30
## [61] 0.29 0.28 0.27 0.26 0.24 0.16 0.00
So the batch effect we observed on the PCA plot quite drastically is actually generating 42% of the Variance in our data! Can you tell me what it could mean that a principle component explains variance in the data? What does this Scree plot explain to us?
Differential expression analysis is a key step in interpreting RNA-seq data, aiming to identify genes whose expression levels differ significantly between conditions. This process can reveal important insights into biological processes and diseases.
Classical statistical methods, such as t-tests, may not work well for RNA-seq data due to its unique characteristics. RNA-seq data is typically overdispersed, meaning the variance exceeds the mean, violating the assumptions of many classical statistical methods. Additionally, RNA-seq data is count-based and often has a large number of zero counts, which can lead to false positives or negatives when using classical methods.
To address these challenges, specialized statistical methods and tools have been developed for differential expression analysis of RNA-seq data, such as DESeq2, edgeR, and limma-voom. These methods account for the overdispersion and count-based nature of RNA-seq data, providing more accurate and reliable results.
The process typically involves data normalization, dispersion estimation, and hypothesis testing. The result is a list of genes ranked by significance, along with statistics such as fold change and false discovery rate (FDR).
In summary, differential expression analysis is a powerful tool for identifying genes that are differentially expressed between conditions, providing valuable insights into the underlying biology. However, it’s important to use appropriate statistical methods and tools that account for the unique characteristics of RNA-seq data.
Limma (Linear Models for Microarray Data) is a widely used R package for the analysis of gene expression data, including both microarray and RNA-seq data. Limma was originally developed for the analysis of microarray data, but has since been adapted for the analysis of RNA-seq data using the voom transformation.
The voom (variance modeling at the observation level) transformation is used to transform RNA-seq count data into a format that can be analyzed using linear models. The voom transformation estimates the mean-variance relationship of the count data and generates a set of weights that are used to account for the variability in the data. These weights are then used in the linear modeling step to improve the accuracy and power of the analysis. We will not use voom, as our data already has been normalized with a method that works similar (DESeq2’s VST).
Limma uses linear models to test for differential expression between conditions, while accounting for various sources of variability, such as batch effects and other covariates. The linear models include design matrices that specify the experimental design and allow for the estimation of log2 fold changes between conditions.
One of the key features of limma is its empirical Bayes method for estimating gene-wise variances. This method borrows information across genes to improve the stability and accuracy of the variance estimates, especially for genes with low counts.
Reference: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47. doi:10.1093/nar/gkv007.
Lets start with the basic analysis of our datasets! We’ll see how to handle batch effects in the next chapter. We’ll start with the combined data, and can then also look at the difference between the datasets.
Since it is 2024, and everyone (me included) would probably just ask your favorite AI for at least some information anyways, let’s look at how ChatGPT would handle this (feel free and encouraged to look through guides for Limma and different stuff). I gave it the text from the last passage and asked it, if it can help me with the code for this:
ChatGPT answer: Steps: 1: Prepare the Data: Load your normalized data (already processed with DESeq2’s VST).
2: Set Up Design Matrix: Define the experimental design.
3: Fit the Model: Use limma functions to fit a linear model.
4: Extract Differentially Expressed Genes (DEGs): Filter genes based on adjusted p-values.
5: Visualize Results: Create volcano plots and summaries.
Here is its implementation:
# Load necessary libraries
library(limma)
library(ggplot2)
# Step 1: Load normalized data (assumed to be VST-transformed counts)
# Replace 'normalized_counts' with your actual data object
# Rows: genes, Columns: samples
normalized_counts <- read.csv("GSE68799_vst_normalized.csv", row.names = 1)
# Load metadata
# Ensure metadata includes a column 'Condition' (e.g., "Normal" or "Tumor") for design
metadata <- read.csv("GSE68799_metadata.csv")
# Ensure row names of metadata match column names of normalized_counts
metadata <- metadata[match(colnames(normalized_counts), rownames(metadata)), ]
# Step 2: Set up the design matrix
# Convert condition to a factor
metadata$Condition <- factor(metadata$Condition, levels = c("Normal", "Tumor"))
design <- model.matrix(~0 + Condition, data = metadata)
colnames(design) <- levels(metadata$Condition)
# Step 3: Fit the linear model
fit <- lmFit(normalized_counts, design)
# Define contrasts (e.g., Tumor vs Normal)
contrast_matrix <- makeContrasts(Tumor_vs_Normal = Tumor - Normal, levels = design)
# Apply contrasts to the fit
fit2 <- contrasts.fit(fit, contrast_matrix)
# Apply empirical Bayes moderation
fit2 <- eBayes(fit2)
# Step 4: Extract DEGs
# Set a cutoff for adjusted p-value and log2 fold change
adj_pval_cutoff <- 0.05
logfc_cutoff <- 1
# TopTable gives a ranked table of results
deg_results <- topTable(fit2, coef = "Tumor_vs_Normal", adjust = "BH", number = Inf)
# Filter DEGs by adjusted p-value and logFC
deg_filtered <- subset(deg_results, adj.P.Val < adj_pval_cutoff & abs(logFC) > logfc_cutoff)
# Save DEGs to a file
write.csv(deg_filtered, "DEGs_GSE68799.csv")
# Step 5: Visualization
# Volcano plot
volcano <- ggplot(deg_results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = -log10(adj_pval_cutoff), color = "red", linetype = "dashed") +
geom_vline(xintercept = c(-logfc_cutoff, logfc_cutoff), color = "blue", linetype = "dashed") +
labs(title = "Volcano Plot: Tumor vs Normal",
x = "Log2 Fold Change",
y = "-Log10 Adjusted P-Value") +
theme_minimal()
print(volcano)
# Summary of DEGs
cat("Number of DEGs:", nrow(deg_filtered), "\n")
This almost works out of the box! Just needs a few changes, so go ahead and do it! Just remember when you work with tools like ChatGPT that you use it as a tool, and not like a fool. Make sure the steps make sense, read about what these steps actually do. What does it actually mean to fit a linear model to something? What does the eBayes do? Is there some false discovery rate detection or correction included? What is the input data (and in which form does it have to be), what happens to it on the way, and what is the output? It can easily happen that you end up with a plot and you don’t even understand what is on there, because we skipped too many steps in the normal process. So it is important to go back and look at them!
Now we do have a number of differentially expressed genes (DEGs), which we can plot, subset and further categorize. This is how such a plot could look like.
Questions: Why would we need to do a differential expression analysis instead of using the counts form the RNA-sequencing directly? Why is this analysis necessary for downstream analysis on gene level?
P-Values: The p-value is a statistical measure that helps determine the significance of a hypothesis test. In the context of a moderated t-test, such as the one used in the limma package in R, the p-value represents the probability of observing the test statistic (or a more extreme value) assuming the null hypothesis is true. The null hypothesis typically states that there is no difference between the groups being compared.
The calculation of the p-value in a moderated t-test involves first computing the moderated t-statistic, which takes into account both the observed differences between the groups and the associated uncertainties. This moderated t-statistic is then compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. The degrees of freedom are calculated based on the number of samples and the estimates of gene-wise variances. The p-value obtained from this process indicates the level of evidence against the null hypothesis, with smaller p-values suggesting stronger evidence in favor of the alternative hypothesis (i.e., there is a significant difference between the groups).
Why do we use the adjusted P-value? Which adjustment method is used by limma?
As previously mentioned, batch effects can have huge impacts on downstream analysis. Modern solutions for differential expression analysis that use linear models to model gene expression, can be fed with additional information about the batches, if these are known. (As in our case). However its important to note, that you can also have artifacts that are not directly correlated with the known batch! For this there are packages such as SVA - Surrogate variable analysis - that provide algorithms for detecting and correcting batch effects.
We’ll first use the information we have and give it to limmas model and then compare our DEGs, then we’ll run SVA to see what it detects and how well it correlated with our batches.
For this we’ll need to consider a few things: a. To be able to compare our datasets/batches, we’ll need to have a vector containing numbers that represents the batches. You can use %>% mutate() in combination with the ifelse() function to generate a new column in the meta.data, called Batch (or whatever you like).
SVA will take in the count data and a design matrix, so the inputs are the same as above. Sva suggests to add another design matrix mod0 for the best results.
Now we have both SVs and a numerical batch representation and can use cor() to compute pairwise correlations of the SVs with the batch.
This is really difficult. Don’t be demotivated if you have no idea what to do. Try your best and we will talk about it at the end of todays course :)
You can also run limma and sva on just one of the datasets, but then you can only use the biological group (normal/tumor) instead of batch information.
Other questions to the batch correction: How different can a batch effect look in a PCA or different dimension reduction strategies? Is it always correct to use a statistical approach to remove unknown “noise” when the biological effects themselves are subject of the experiment? How troublesome or problem-free can batch correction be? especially if there is only information about the biological group (i.e. tumor/normal) instead of actual batches? Would you say that in this case, batch information leads to an over-correction? What can an over-correction look like? Can i look at singewl datasets instead of the combined sets to make an easier decision? What would it mean for downstream analysis? Please discuss!
We use the differential expressed gene(deg) results from limma for further analysis.
library(sva)
metadata_sva <- metadata_combined[match(colnames(counts_combined), rownames(metadata_combined)), ]
# needs a column with the biological "condition" or group:
metadata_sva$Condition <- ...
# introduce a new column that gives the different datasets a "1" and "2" depending on the dataset they are from
metadata_sva$Batch <- ifelse(...)
# for the function they need to be factors instead of strings or numbers
metadata_sva$Batch <- factor(...)
#insert correct columns:
design_sva <- model.matrix(~0 + ... + ..., data = metadata_sva)
mod0 <- model.matrix(~..., data = metadata_sva)
# Estimate surrogate variables
sva_obj <- ...
# Create valid names for surrogate variables
colnames(sva_obj$sv) <- paste0("SV", seq_len(ncol(sva_obj$sv)))
# Add surrogate variables to the design matrix
design_with_sv <- ...)
# Create valid names for surrogate variables
# colnames(sva_obj$sv) <- paste0("SV", seq_len(ncol(sva_obj$sv)))
# continue with differential gene expressionand plot
contrast_matrix_sva <- ...
fit_sva <- ...
...
## Number of significant surrogate variables is: 16
## Iteration (out of 5 ):1 2 3 4 5
## Number of significant surrogate variables is: 8
## Iteration (out of 5 ):1 2 3 4 5
## [,1]
## SV1 -0.988932035
## SV2 0.084020894
## SV3 -0.008615526
## SV4 0.015085951
## SV5 -0.050238748
## SV6 -0.022877994
## SV7 0.010047065
## SV8 -0.016680526
Heatmaps visually represent gene expression levels across samples or genes. They provide an intuitive way to identify patterns and trends within large datasets. They can also be used to cluster genes based on their expression profiles. Basically, clustering can reveal co-regulated genes, or similar expression patterns between biological groups (i.e.Tumor/non-Tumor cells). They can help to make differential expression data make sense to the human eye to select targets/candidates that are differentially expressed for further research questions. The clusters we observe here, in their own right, can represent expression patterns that are typical for the disease we are dealing with. The first step before we start with the heatmap is to install and load in the necessary libraries for the heatmap. We will be using pheatmap today for this task.
We will take the data from the differential expression to create our heatmap. We’ll use the significant results that we got from the differential expression by filtering it with the adjusted P.value. Depending on your research question, it can make sense to take the most different genes for a selected feature (for example the logFC) or a specific list of genes. We cannot look at the much more intuitive gene symbols yet though (only the Esembl IDs), since we have not yet transformed the IDs to annotated gene names, but we will get to that later.
#order the results by their abs(logFC):
# start by getting the significant genes that are over the defined cutoffs:
# Filter the genes based on thresholds
significant_genes <- subset(deg_results, ...)
logFC_ordered <- significant_genes[...]
# filter out the row names and create a filtered_count matrix (match the rownames of res_sig and the counts for a new df):
counts_filtered <-
#there are many ways to do this, make sure you end up with the top_100_genes by logFC
top_100_genes <- counts_filtered[1:100, ]
# start by getting the significant genes that are over the defined cutoffs:
# Filter the genes based on thresholds
significant_genes <- subset(deg_results, adj.P.Val < adj_pval_cutoff & abs(logFC) > logfc_cutoff)
sg <- as.data.frame(significant_genes)
dim(sg)
## [1] 1650 6
#order res_sig by logFC
logFC_ordered <- significant_genes[order(abs(significant_genes$logFC), decreasing = TRUE), ]
filtered_row_names <- rownames(logFC_ordered)
# Filter df_unfiltered to have the same rows as df_filtered
counts_filtered_combined <- counts_combined[rownames(counts_combined) %in% filtered_row_names, ]
counts_filtered <- counts_combined[rownames(counts_combined) %in% filtered_row_names, ]
tdf <- t(counts_filtered)
merged <- merge(metadata_combined, tdf, by = "row.names")
merged_df <- column_to_rownames(merged, "Row.names")
# top_50_genes <- counts_filtered[1:50, ]
top_100_genes <- counts_filtered[1:100, ]
library(pheatmap)
#to still see if the cells were tumor/non-tumor cells, we need to create a df for the annotation, for that sake, we also need the metadata again:
class_annotation <- merged_df$class
class_annotation_df <- data.frame(class = class_annotation)
# top_100_genes_t <- t(top_100_genes)
# rownames(class_annotation_df) <- rownames((top_100_genes_t))
# No need to transpose twice
rownames(class_annotation_df) <- colnames(top_100_genes)
top_100_genes_t <- t(top_100_genes)
# rownames(class_annotation_df100) <- rownames((top_100_genes_t))
all.equal(rownames(class_annotation_df), colnames(top_100_genes))
## [1] TRUE
#metadata als coldata
heatmap <- pheatmap(top_100_genes, clustering_distance_rows = 'euclidean',
clustering_distance_cols = 'euclidean', scale = 'row',
clustering_method = 'complete', show_rownames = F,
display_numbers = FALSE, main = "Heatmap for top 100 genes from combined data",
show_colnames = F, annotation_col = class_annotation_df)
print(heatmap)
As you can see, the heatmap also includes a hierarchical clustering on the left side. We can use these clusters for further analysis, by cutting the tree at a desired depth, or number k of clusters.
From here we can use the cutree() function from the stats library to get two clusters (look at the cuttree() function in RStudio) then get the geneIDs from the top 100 genes and create two new dataframes.
row_dendrogram <- ...
# Cut the dendrogram into 2 clusters
clusters <- cutree(...)
# Create a new data frame for Cluster 1, remember that you can do this in many possible ways and in different orders if you want to.
cluster1_data <- top_100_genes[...]
# Create a new data frame for Cluster 2
cluster2_data <- top_100_genes[...]
I also plotted the mean expression of the two clusters to show the differenc ein expression between them.
For the heatmap I told you before, that we cannot yet assign gene symbols to the heatmap but only Ensembl IDs. Let’s talk about what is going on with these “ENSG00000000003” and other IDs you have seen throughout the exercises.
Ensembl IDs and other Gene Ids: Ensembl gene IDs are unique identifiers assigned by the Ensembl project to represent individual genes in various organisms, including humans. These IDs provide a standardized and stable way to reference genes across different databases and analysis platforms. Ensembl gene IDs typically follow a format like “ENSG00000123456,” where the numbers uniquely identify a specific gene. In RNA sequencing experiments, you often start with raw sequence data, which is then processed through various steps such as alignment, quantification, and differential expression analysis. When mapping Ensembl gene IDs to other gene identifiers, tools like BioMart (available through the biomaRt R package) are commonly used. BioMart allows you to retrieve information about genes, transcripts, and other genomic features from various databases, including Ensembl. There exist several other identifiers for genes like “Gene Symbol”, “Entrez Gene ID”, “UniProt ID”, and so on. While gene symbols are the only really intuitive IDs, they are often ambiguous to isoforms or even homologs, which is why we often use other IDs for work on databases and annotations.
Now we would like to map the Ensembl IDs to Gene IDs, so that we can continue with our analysis and make sense of thegenes we are dealing with. We will be using bioMart for this task. We will download the “hsapiens_gene_ensembl” dataset in the “ensembl” format.
The biomaRt package in R provides an interface to the BioMart web service, which allows you to access biological data resources provided by various databases, including Ensembl, UniProt, HGNC, and more.
library(biomaRt)
library(fgsea)
# Set up the biomart object to map the genes
# for some reasons, sometimes this does not work and you can try to use different mirrors like asia or uswest:
#ensembl_mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl", host = "https://www.ensembl.org")
ensembl_mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", mirror = "asia")
## Ensembl site unresponsive, trying useast mirror
we can use getBM() from the biomaRt library. Look at the info for the function in Rstudio, and let’s construct it together.
#possible solutions
# getBM(attributes, filters = "", values = "", mart, curl = NULL,
# checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
# quote = "\"", useCache = TRUE)
#attributes wants a list of attributes, which in this case means the gene id, and the corresponding name, since we want to replace one with the other, so:
attributes = c("ensembl_gene_id", "external_gene_name")
# attributes <- c("ensembl_gene_id", "external_gene_name", "description")
# the filter (that we want to map to the corresponding values) would be the gene id again:
filter = "ensembl_gene_id"
# the values would be the rownames of cluster1_data in form of a vector. They are used to map to the ensembl_gene_id:
values = rownames(cluster1_data)
# the mart-object is the mart object we just created:
mart = ensembl_mart
# if the mirror is down for some reason:
# ensembl_mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# so for the first cluster we get:
gene_symbol_mapping_cluster_1 <- getBM(attributes = c("ensembl_gene_id",
"external_gene_name"),
filters = "ensembl_gene_id", values = rownames(cluster1_data),
mart = ensembl_mart)
Im sure you can now do this for the second cluster by yourself :) Also, there are definitely a lot of other ways to get this done. That is just how I ended up doing it after struggling for a while.
Alright, now we have two dataframes that consist of a mapping from the ensembl IDs to Gene IDs. We now want to join the dataframes back together, so that we have the Gene IDs as columns in the cluster data. You have done something similar before, and there are again multiple ways to do it. I for example used merge(X,Y, by.x=“something”, by.y=“some-other-thing”) for the task.
Now we can also get the actual gene information:
# Get gene information for Cluster 1
cluster1_gene_info <- getBM(attributes = attributes,
filters = "ensembl_gene_id",
values = gene_ids_cluster1,
mart = ensembl_mart)
# Get gene information for Cluster 2
cluster2_gene_info <- getBM(attributes = attributes,
filters = "ensembl_gene_id",
values = gene_ids_cluster2,
mart = ensembl_mart)
Gene Set Enrichment Analysis (GSEA) is a statistical method used to determine whether predefined sets of genes (gene sets) exhibit statistically significant differences between two biological states. GSEA is widely used in genomics to interpret high-throughput data, such as gene expression profiles obtained from RNA-Sequencing or microarray experiments. These gene sets are for example biological pathways, but can also be molecular functions or cellular components in which the genes play a part in. You could for example look whether genes in a certain biological pathway are overexpressed in cancer cells in contrast to healthy cells. This can then be used as a target or as an explanation in further research questions. There are multiple databases providing such gene lists, such as KEGG, GO and the GSEA-database that originally came with the paper (Subramanian et al. 2005https://www.pnas.org/doi/10.1073/pnas.0506580102, Mootha et al. https://www.nature.com/articles/ng1180, DB: GSEA (gsea-msigdb.org)).
We already have half of the parts that are needed for the GSEA (the annotated gene scores), what we need now is a set of genes that can belong to biological pathways for example. For this there is a new file in LMS (hopefully) otherwise you’ll have to search for it yourself :) There are different files that can be used as a pathway for fgsea, in this case, the file contains 50 typical pathways that you can run your gene scores against. I addition to this pathways, we also need a named vector for our gene scores. We basically want to add the gene names to a stat/feature that we want to use fgsea on (for example logFC). Named vectors are a special data structure that can be created like this:
library(msigdbr)
msig_hallmark <- msigdbr(species = "Homo sapiens", category = "H")
# This will give you a data frame of Hallmark gene sets.
pathways.hallmark <- msig_hallmark %>%
group_by(gs_name) %>%
summarise(genes = list(gene_symbol)) %>%
deframe()
logFC_named <- setNames(...)
fgsea_re <- fgsea(
pathways = pathways.hallmark,
stats = logFC_named,
scoreType = "pos"
)
logFC_named2 <- setNames(...)
fgsea_re2 <- fgsea(pathways = pathways.hallmark,
stats = logFC_named2,
)
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : All values in the stats vector are greater than zero and scoreType
## is "std", maybe you should switch to scoreType = "pos".
first, we want to combine the results again, since it makes it easier to plot the clusters against each other. Again, there are many ways to do this, I used rbind() to combine them. Just remember that you’ll want to keep the information about the the clusters in the new data frame. Get familiar with some different graph forms concerning Gene set enrichments and the fgsea, do some google work, and hopefully you can manage to generate some graphs too :)
Final questions: What did the fgsea do with the input data? If you are looking for something more specific (pathway, gene, target…), at which points would you change something in the workflow? What can we say in the end about the difference between tumor cells and normal cells in the data?
. . . . . . . . . . . . . . . . . . . . . . . . . .
If you have been working diligently, here is the link to the solutions if you need them right away; https://drive.google.com/file/d/1wZ4DchzsCK_yPPfcA23FUVZRV8tVZoC5/view?usp=drive_link