This is a step by step tutorial for differential gene expression (DE) analysis in the statistical environment R. It will focus on expression data obtained by microarrays, a high-throughput technology for genome-wide expression profiling. The analysis will be done on an example data set resulting from a larger multi-factor experimental design that involves estrogen receptor positive breast cancer cell lines with the aim to identify early- and late-response targets of estrogen treatment. For literature and software references see section References.
Requirements: Installed recent versions of R or RStudio on your computers.
For this tutorial, you need to install several packages from Bioconductor, an open-source software development project that provides tools for the analysis and comprehension of high-throughput genomic data. Other useful R packages and documentation are available from the Comprehensive R Archive Network CRAN.
# set two lists with the names of the Biocodnuctor and the CRAN packages
bioc.libs <- c("Biobase", "affy", "annotate", "hgu95av2.db", "hgu95av2cdf", "estrogen",
"limma", "genefilter", "vsn", "GEOquery", "annaffy", "GO.db","KEGG.db")
cran.libs <- c("RColorBrewer", "R.utils", "gplots")
Before we continue with installing, lets check for available updates on the already available packages, using the function biocLite
from the Bioconductor package BiocInstaller
. If invoked with no arguments it installs core packages (when Bioconductor is installed for the first time) or updates already installed packages.
source("http://bioconductor.org/biocLite.R") # get BiocInstaller
biocLite() # if they are outdated packages, choose the option to update all the packages
After the update, restart the R session (in Rstudio go to menu tab Session -> Restart R). Now load the BiocInstaller
package and finally install the packages saved in the lists above.
library(BiocInstaller)
biocLite(bioc.libs)
biocLite(cran.libs)
After successful installation, restart the R session again, and check the status of the available packages: in your case, everything should be up to date.
library(BiocInstaller)
biocValid() # should return TRUE
A short reminder on how to get help for various topics (e.g., packages, functions, operators, concepts) as follows
– help()
or ?
is primary help for functions and features,
– help.search()
or ??
finds help pages on a vague topic,
– library()
lists available packages and the help objects they contain,
– data()
lists available data sets,
– ls
lists names of the objects in a specified environment.
help(package="affy") # get help even when package is not loaded
?? affy
help.search("affy")
? sapply #help(sapply)
#' some topics, such as operators or control-flow reserved words, need to be quoted
? "if"
? "+"
## open the vignette for the Biobase package (the package has to be already attached)
library(Biobase) # base functions for Bioconductor
openVignette("Biobase")
# list the content of the package (the package has to be already attached)
ls("package:Biobase")[1:12] # the list is long, so display the first 12 entries only
## [1] "abstract" "addVigs2WinMenu" "aggenv" "aggfun"
## [5] "Aggregate" "AnnotatedDataFrame" "annotatedDataFrameFrom" "annotation"
## [9] "annotation<-" "anyMissing" "as.data.frame.ExpressionSet" "assayData"
# shows the user-defined datasets and functions in the current enviorment
ls()
## [1] "bioc.libs" "cran.libs"
It is quite handy to have a project-specific working directory, where the R process will save by default plots or data of the project-related analysis. This can be done by using the function setwd
.
main.dir <- file.path(getwd(), "tutorial_project") # make file path in a platform-independent way
dir.create(main.dir) # create a subdirectory for the current project
setwd(main.dir) # set the working directory
getwd() # display the current working directory
## [1] "/home/taskova/R/teaching/tutorial_project"
You can always attach an available (installed) package later in the script, just before using its functions and/or data. For simplicity, we will include most of the packages at this point.
library(affy) # data analysis and exploration of Affymetrix array probe level data
library(annotate) # annotation functionality
library(hgu95av2.db) # gene chip specific metadata
library(hgu95av2cdf) # info relating probe pair sets to locations on the array
library(estrogen) # data for our analysis
library(genefilter) # gene filtering functionality
library(limma) # linear modeling and statistical testing for differential expression
library(R.utils) # additional useful functions
library(RColorBrewer) # nice looking color palettes
library(gplots) # heatmap function
sessionInfo
is an important function for reproducibility of results. It prints version information about R, the OS and attached or loaded packages. Best practice is to include it as the last command in the R script and save the result in a file.
sessionInfo() # output on the screen
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 15.10
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8
## [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gplots_2.17.0 RColorBrewer_1.1-2 R.utils_2.1.0 R.oo_1.19.0 R.methodsS3_1.7.0 limma_3.24.14
## [7] genefilter_1.50.0 estrogen_1.14.0 hgu95av2cdf_2.16.0 hgu95av2.db_3.1.3 org.Hs.eg.db_3.1.2 RSQLite_1.0.0
## [13] DBI_0.3.1 annotate_1.46.1 XML_3.98-1.3 AnnotationDbi_1.30.1 GenomeInfoDb_1.4.1 IRanges_2.2.5
## [19] S4Vectors_0.6.2 affy_1.46.1 Biobase_2.28.0 BiocGenerics_0.14.0
##
## loaded via a namespace (and not attached):
## [1] BiocInstaller_1.18.5 formatR_1.2 bitops_1.0-6 tools_3.2.2 zlibbioc_1.14.0 digest_0.6.8
## [7] evaluate_0.8 preprocessCore_1.30.0 yaml_2.1.13 stringr_1.0.0 knitr_1.12.3 caTools_1.17.1
## [13] gtools_3.5.0 survival_2.38-3 rmarkdown_0.9.2 gdata_2.17.0 magrittr_1.5 codetools_0.2-14
## [19] htmltools_0.3 splines_3.2.2 xtable_1.7-4 KernSmooth_2.23-15 stringi_1.0-1 affyio_1.36.0
capture.output(sessionInfo(), file="session.log") # output to a file in the working directory
Estrogen dataset
In this case, the data are provided by the package estrogen
, which includes 8 samples from an estrogen receptor positive breast cancer human cell lines. After serum starvation, 4 samples were exposed to estrogen and 4 samples were left untreated. Each group of samples was profiled with Affymetrix HGU95Av2 gene chips after 10 hours (2 samples) and 48 hours (2 samples). The complete description of the data, that involves in total 32 samples from a experiment with 4 factors (estrogen treatment, time, drug treatment, translational inhibitor treatment), is given by Scholtens et al. (2004).
We can access the raw data, i.e., the probe intensity level data in .CEL format, as outputted by the Affymetrix software. These files are located in a subdirectory called extdata
belonging to the corresponding package on your computer’s hard disk. The function system.file
will help you find the exact location. For your own data, put all of the .CEL files in the same directory and specify the full path to it.
data.dir <- system.file("extdata", package="estrogen") # or provide the path
data.dir
## [1] "/home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata"
dir(data.dir) # lists the content of the directory
## [1] "bad.cel" "estrogen.txt" "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel" "low10-2.cel"
## [9] "low48-1.cel" "low48-2.cel" "phenoData.txt" "targLimma.txt"
GEO datasets
If you use data from the publicly available repositories of microarray data, such as the NCBI Gene Expression Omnibus (GEO), then the GEOquery
package is quite handy. A single function, getGEOSuppFiles
, can take as an argument a GEO accession and will download all the raw data associate with that accession. By default, the function will create a directory in the current working directory to store the raw data for the chosen GEO accession. The following code chunk is illustrates how to get the data with accession number GSE20986.
library(GEOquery) # load the package
geo <- getGEOSuppFiles("GSE20986") # directory 'GSE20986' was created in the working directory
# data are compressed and archived so we have to extract them before using them
geo.dir <- file.path("GSE20986","data")
untar(file.path("GSE20986","GSE20986_RAW.tar"),
exdir=geo.dir # unarchive into subdirectory 'data' of directory 'GSE20986'
)
cel.files <- list.files(geo.dir, pattern=".gz$", full.names=TRUE) # list files with file extension '.gz'
sapply(cel.files, gunzip) # uncompress each file with the function 'gunzip'
dir(geo.dir)
## [1] "GSM524662.CEL" "GSM524663.CEL" "GSM524664.CEL" "GSM524665.CEL" "GSM524666.CEL" "GSM524667.CEL" "GSM524668.CEL" "GSM524669.CEL"
## [9] "GSM524670.CEL" "GSM524671.CEL" "GSM524672.CEL" "GSM524673.CEL"
# now we can set data.dir, but we won't do it as our focus is on the estrogen data
# data.dir<-file.path(getwd(), "data_geo")
Sample-description data
The Biobase::AnnotatedDataFrame
object is designed to standardize the storage and manipulation of phenotypic data and its metadata.
– Phenotypic data summarizes information about the samples (e.g., sex, age, and treatment status; referred to as covariates). The information can be represented as a table with S rows and V columns, where V is the number of covariates and S is the number of samples. These data can be accessed/set with the pData
function.
– Metadata about the covariates, providing for each longer description, possible values, units etc.). These data can be accessed/set with the function varMetadata
.
See the vignette of the Biobase
package for more information.
The file estrogen.txt contains information on the samples that were hybridized onto the chip arrays. It can be used to create an AnnotatedDataFrame instance by calling the function Biobase::read.AnnotatedDataFrame
. The header option tells the function that the first line in the file contains column headings, and the row.names option indicates that the first column of the file contains the row names. Note that you can always create similar text files in text editors of your preference, and then load them in R.
pd <- read.AnnotatedDataFrame(file.path(data.dir,"estrogen.txt"),
header=TRUE, sep="", row.names=1)
pd
## An object of class 'AnnotatedDataFrame'
## rowNames: low10-1.cel low10-2.cel ... high48-2.cel (8 total)
## varLabels: estrogen time.h
## varMetadata: labelDescription
sampleNames(pd) # the names of the samples, file names in this case
## [1] "low10-1.cel" "low10-2.cel" "high10-1.cel" "high10-2.cel" "low48-1.cel" "low48-2.cel" "high48-1.cel" "high48-2.cel"
pData(pd) # returns a data frame with samples as rows and covariates as columns
## estrogen time.h
## low10-1.cel absent 10
## low10-2.cel absent 10
## high10-1.cel present 10
## high10-2.cel present 10
## low48-1.cel absent 48
## low48-2.cel absent 48
## high48-1.cel present 48
## high48-2.cel present 48
varMetadata(pd) # returns description of the covariates (empty at the moment)
## labelDescription
## estrogen
## time.h
You can also create the data frame object from scratch (when no text file is provided). The minimal information needed is a data frame of phenotypic data. For better readability and no confusion, it is always good to provide description of the covariates as metadata.
pd.new <- AnnotatedDataFrame()
df <- data.frame(estrogen=rep(rep(c("absent", "present"), each=2), times=2),
time_h=rep(c(10, 48), each=4),
replicate=rep(c(1,2), times=2),
estrogen_tmp=rep(rep(c("low", "high"), each=2), times=2)) # if you prepare the data frame
rownames(df) <- apply(df, 1,
function(item) # makes the sample name, i.e. file name
{paste(item["estrogen_tmp"],
item["time_h"],"-",
item["replicate"],
".cel", sep="")
}
)
pData(pd.new) <- df[, 1:3] # remove the last column, not needed
varMetadata(pd.new) <- data.frame(labelDescription=
c("Estrogen treatment status with 2 values, absent and present",
"Time of measurement after treatment with 2 values, 10 and 48 hours",
"Biological replicate id within exper. group with 2 values, 1 and 2"),
row.names=colnames(pd.new))
pData(pd.new)
varMetadata(pd.new)
## estrogen time_h replicate
## low10-1.cel absent 10 1
## low10-2.cel absent 10 2
## high10-1.cel present 10 1
## high10-2.cel present 10 2
## low48-1.cel absent 48 1
## low48-2.cel absent 48 2
## high48-1.cel present 48 1
## high48-2.cel present 48 2
## labelDescription
## estrogen Estrogen treatment status with 2 values, absent and present
## time_h Time of measurement after treatment with 2 values, 10 and 48 hours
## replicate Biological replicate id within exper. group with 2 values, 1 and 2
Probe level data
Importing probe intensities from .CEL files is done with the function affy::ReadAffy
. This function returns an affy::AffyBatch
object that stores both probe-level intensities and metadata for the samples. This object can store also other data such as:
– experimentData
summarizes experiment-level data, e.g. the investigator or lab where the experiment was done,
– featureData
keeps metadata about the features (probes) on the gene chip,
– protocolData
keeps protocol data for sample processing.
Note that if the function is called with no arguments all the CEL files (if any) in the working directory are read. Since our files are not in the working directory, we have to specify a character vector with the file paths.
files <- list.celfiles(path=data.dir, full.names=TRUE) # list of absolute paths to all .CEL files
files <- files[!grepl(pattern="bad.cel", files)] # grepl returns a logical vector (pattern match or not for each element of files), operator ! means logical negation
a <- ReadAffy(filenames=files, # pass the list of .CEL files
verbose=TRUE)
## 1 reading /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-1.cel ...instantiating an AffyBatch (intensity a 409600x8 matrix)...done.
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high48-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high48-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low10-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low10-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low48-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low48-2.cel
a
## AffyBatch object
## size of arrays=640x640 features (20 kb)
## cdf=HG_U95Av2 (12625 affyids)
## number of samples=8
## number of genes=12625
## annotation=hgu95av2
## notes=
pData(protocolData(a)) # scan date for each array (important, as can be potential source of batch effect)
## ScanDate
## high10-1.cel 06/28/02 13:30:01
## high10-2.cel 06/27/02 13:01:01
## high48-1.cel 01/11/02 13:32:10
## high48-2.cel 01/14/02 12:59:04
## low10-1.cel 06/27/02 13:34:19
## low10-2.cel 06/28/02 14:10:26
## low48-1.cel 01/11/02 13:20:16
## low48-2.cel 01/14/02 12:47:24
When passing the phenotypic data frame, be careful about two things. First, number of samples has to be equal to the number of .CEL files, and second the order has to match as well.
all(rownames(pd.new)==colnames(a))
rownames(pd.new)
colnames(a)
## [1] FALSE
## [1] "low10-1.cel" "low10-2.cel" "high10-1.cel" "high10-2.cel" "low48-1.cel" "low48-2.cel" "high48-1.cel" "high48-2.cel"
## [1] "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel" "low10-2.cel" "low48-1.cel" "low48-2.cel"
a <- a[,rownames(pd.new)] # change the order of the samples to fit to pd.new
phenoData(a) <- pd.new
all(rownames(pData(a))==colnames(a))
## [1] TRUE
We can always customize the information in a format suitable for us. For example, lets replace the sample names (the original file names) with shorter ones.
sample.names <- sub(pattern=".cel", replacement="", colnames(a))
sample.names <- sub(pattern="low", replacement="NEG", sample.names)
sample.names <- sub(pattern="high", replacement="ES", sample.names)
print(sampleNames(a) <- sample.names)
## [1] "NEG10-1" "NEG10-2" "ES10-1" "ES10-2" "NEG48-1" "NEG48-2" "ES48-1" "ES48-2"
Obtaining gene expression level data involves three steps:
– Background adjustment, to correct for non-specific hybridization and optical detection noise.
– Normalization, to make the different arrays comparable.
– Summarization, to obtain summary expression values (per probeset) from the probe intensities. One gene can be represented with more than one probeset.
affy
has many methods for each step, that can be selected by using the two general functions expresso
and threestep
, or by special functions to produce specific expression measures such as rma
or mas5
.
bgcorrect.methods() # list available background adjustment methods
## [1] "bg.correct" "mas" "none" "rma"
normalize.AffyBatch.methods() # lists available normalization methods for AffyBatch objects
## [1] "constant" "contrasts" "invariantset" "loess" "methods" "qspline" "quantiles"
## [8] "quantiles.robust"
express.summary.stat.methods() # lists available expression summarization methods
## [1] "avgdiff" "liwong" "mas" "medianpolish" "playerout"
For example a call to expresso
with mas background adjustment (the same is used in the Affymetrix MAS 5.0 software), quantile normalization and summarization by averaging probe intensities (only the PM probes)
a.exp <- expresso(a, bgcorrect.method="mas", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method="medianpolish")
We will use the rma
function, which implements the Robust Multi-array Average (RMA) algorithm that combines convolution background correction, quantile normalization and median-polish-based multi-array summarization. Note that the function returns expression values on a log2 scale.
a.rma <- rma(a)
## Background correcting
## Normalizing
## Calculating Expression
a.rma
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12625 features, 8 samples
## element names: exprs
## protocolData
## sampleNames: NEG10-1 NEG10-2 ... ES48-2 (8 total)
## varLabels: ScanDate
## varMetadata: labelDescription
## phenoData
## sampleNames: NEG10-1 NEG10-2 ... ES48-2 (8 total)
## varLabels: estrogen time_h replicate
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2
The obtained data is an instance of the Biobase::ExpressionSet
object that has standardized structure similar to the one of affy::AffyBatch
. It stores probeset-summarized expression values instead of probe intensities. Explore it with the help of the package documentation.
class(a.rma)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
class(exprs(a.rma))
## [1] "matrix"
head(exprs(a.rma))
## NEG10-1 NEG10-2 ES10-1 ES10-2 NEG48-1 NEG48-2 ES48-1 ES48-2
## 1000_at 10.3982 10.2544 10.0040 9.9035 10.3749 10.0335 10.3451 9.8633
## 1001_at 5.7176 5.8810 5.8596 5.9540 5.9605 6.0209 5.9811 6.2852
## 1002_f_at 5.5126 5.8018 5.5711 5.6081 5.3901 5.4945 5.5081 5.6301
## 1003_s_at 7.7839 8.0080 8.0380 7.8351 7.9265 8.1389 7.9949 8.2333
## 1004_at 7.2892 7.6037 7.4885 7.7715 7.5218 7.5995 7.4561 7.6752
## 1005_at 9.2067 8.9938 8.2379 8.3380 9.1732 9.0405 7.9261 8.0697
Before we proceed with differential expression analysis, we need to perform quality control (QC) assessment and make sure there are no issues with the data. There are many tools that can produce informative reports, including different QC criteria such as expression distribution, RNA degradation. You can have a look at the Bioconductor packages, arrayQualityMetrics
and affyExpress
. Here, we will show several diagnostic plots for graphical exploration of the data quality before and after normalization.
Array surface images
The affy::image
function outputs the spatial distribution of the intensities on a chip, and allows to look for spatial artifacts. In our case, none of the array has problems, but for illustration purposes we include one bad array (bad.cel). We will plot all arrays, including the bad one, on one multiplot figure. The options in the par
function (from the base package graphics
) can be used to set the layout, as follows. You can uncomment the second and the last line if you want the figure to be saved in .jpeg file.
a.bad = ReadAffy(filenames = file.path(data.dir,"bad.cel"), verbose=TRUE)
# jpeg("estrogen_array_images.jpg", quality=100)
par.old = par(no.readonly = TRUE) # save the default par options
par(mfcol=c(3,3), mar=c(0,3,3,1))
sapply(1 : (length(a) + 1),
function(x){
if(x==9){
image(a.bad, main="bad.cel")
}
else{
image(a[, x], main=colnames(a[,x]))
}
})
par(par.old) #return to the default setting
# dev.off()
Boxplots and histograms
The distribution of the unprocessed probe intensity as well as the post-normalized expression values can reveal problems with single arrays as well as issues with the normalization procedure. This can be displayed with boxplots or histograms (density estimators). For example, bimodal distribution in the probe intensity distribution can indicate spatial artifact as that seen for the bad array. Lets plot these for our data. Are the distribution all the same? Do you see any batch effects? Did the normalization reduced the differences in the raw data?
par(mfrow =c(2,2), mar=c(2,3,3,1))
color.set <- brewer.pal(8, "Dark2") # select a color palette
#set the plot limits
lim=c(min(exprs(a.rma)), # min y-axis value
max(max(log2(intensity(a))), max(exprs(a.rma))) # max y-axis value
)
boxplot(a, col=color.set, ylim=lim, las=2, main="Before normalization")
boxplot(exprs(a.rma), col=color.set, ylim=lim, las=2, main="After normalization with RMA")
hist(a, col=color.set, xlim=lim, main="")
plotDensities.default(exprs(a.rma), col=color.set, main="", legend=FALSE) # from limma package
par(par.old)
Unsupervised sample clustering
We can also look at the relationships between the samples using hierarchical clustering (e.g., the dendrogram obtained by the function stat::hclust
and visualized with plot
) or multidimensional scaling (MDS) as obtained with the functionlimma::plotMDS
. Both plots show that the samples from the same group cluster together, with the replicates taken 48h after estrogen treatment showing the largest within-group dissimilarty after normalization.
par(mfrow =c(1,2), mar=c(0,3,3,1))
# dendrogram based on probe intensities
distance <- as.dist(1-cor(exprs(a), method="pearson")) # convert the correlation matrix into a distance matrix
clusters <- hclust(distance) # perform hierarchical clustering based on the given sample distance
plot(clusters, main="Before normalization", xlab="", sub="")
# dendrogram based on summarized expression
distance.rma <- as.dist(1-cor(exprs(a.rma), method="pearson")) # convert the correlation into a distance matrix
clusters.rma <- hclust(distance.rma)
plot(clusters.rma, main="After normalization with RMA", xlab="",sub="")
par(par.old)
# MDS plot
scan.date <- sapply(pData(protocolData(a))[,1], # gets the ScanDate column
function(x) unlist(strsplit(x, split=" "))[1], # filter the time, keep only the date
USE.NAMES=FALSE)
color.set2 <- c("red", "violet", "green","blue")
names(color.set2) <- unique(scan.date)
plotMDS(a.rma, col=color.set2[scan.date], labels=sampleNames(a.rma),
main="Multidimensional scaling plot of RMA-processed data \n(we have a good separation between the groups)")
legend("topright", title="Scan date", legend=names(color.set2),
col=color.set2, pch=15, cex=0.8, bty="n")
RNA degradation
The function affy::AffyRNAdeg
returns information to detect RNA degradation that can be displayed with affy::plotAffyRNAdeg
.
Within each probeset, probes are numbered directionally from the 5’ end to the 3’ end. Probe intensities (scaled and shifted) are averaged by probe number, across all probesets. Linear regression models are fit to these average values per sample, and the resulting slope estimates can be used as a degradation measure. High slopes indicate degradation, however more important than the slope is the agreement between arrays.
deg <- AffyRNAdeg(a, log.it=FALSE)
data.frame(sample=deg$sample.names, slope=deg$slope)
## sample slope
## 1 NEG10-1 0.146641
## 2 NEG10-2 0.132972
## 3 ES10-1 0.150990
## 4 ES10-2 0.058095
## 5 NEG48-1 -0.126269
## 6 NEG48-2 -0.113679
## 7 ES48-1 -0.007458
## 8 ES48-2 -0.186991
plotAffyRNAdeg(deg, col=color.set, asp=1)
legend("topright", title="", legend=deg$sample.names,
col=color.set, lty=1, lwd=2, cex=0.8, bty="n")
The Bioconductor package limma
provides an integrative solution for data analysis of gene expression data. It offers tools for handling complex experimental designs, small sample sizes, data from different platform (microarray, PCR, protein arrays, RNA-seq etc.) as well as nice diagnostic plots for quality assessment and higher-level analysis tools such as the one for gene set testing (the overall significance of a set of co-regulated genes).
The core element of DE withlimma
is the fit of gene-wise linear model to the gene expression data. This is done by estimating the log-intensities (for single-channel data, as in our case) or the log-ratios (for two-channel data) between two or more RNA samples simultaneously. The lmFit
function allows us to fit a model robustly or by least square estimation (default). Once the model is fitted, the makeContrasts
function can be used to form the contrast (comparison of interest) matrix, which together with the model is used by the contrasts.fit
function to calculate the mean log2-fold-changes (logFC) and t-statistics for the contrasts of interest. In the following we will see two ways of specification of the model parameters, i.e., model design matrix. Both can be used to answer the same question, i.e., which are the early- and late-response targets of estrogen treatment.
In the case of estrogen data, we have two factors, time and estrogen treatment. The model is designed to reflect the effect of time in the absence of estrogen (T48), the effect of estrogen (ES) and the time-estrogen interaction effect (ES:T48), assuming the intercept represents the average log2 expression at 10 hours in absence of treatment.
pheno <- pData(a.rma)
design.1 <- model.matrix(~factor(estrogen) * factor(time_h), pheno)
colnames(design.1) <- c("Intercept", "ES","T48","ES:T48") # one name for each model coefficient
design.1
## Intercept ES T48 ES:T48
## NEG10-1 1 0 0 0
## NEG10-2 1 0 0 0
## ES10-1 1 1 0 0
## ES10-2 1 1 0 0
## NEG48-1 1 0 1 0
## NEG48-2 1 0 1 0
## ES48-1 1 1 1 1
## ES48-2 1 1 1 1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$`factor(estrogen)`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`factor(time_h)`
## [1] "contr.treatment"
fit.1 <- lmFit(a.rma, design.1) # returns an object of class MArrayLM
fit.1
## An object of class "MArrayLM"
## $coefficients
## Intercept ES T48 ES:T48
## 1000_at 10.3263 -0.372516 -0.12207 0.272517
## 1001_at 5.7993 0.107485 0.19140 0.034936
## 1002_f_at 5.6572 -0.067603 -0.21491 0.194421
## 1003_s_at 7.8960 0.040608 0.13673 0.040851
## 1004_at 7.4464 0.183606 0.11425 -0.178612
## 12620 more rows ...
##
## $rank
## [1] 4
##
## $assign
## [1] 0 1 2 3
##
## $qr
## $qr
## Intercept ES T48 ES:T48
## NEG10-1 -2.82843 -1.4142 -1.4142e+00 -0.70711
## NEG10-2 0.35355 1.4142 1.1102e-16 0.70711
## ES10-1 0.35355 -0.4459 1.4142e+00 0.70711
## ES10-2 0.35355 -0.4459 3.5355e-01 0.70711
## NEG48-1 0.35355 0.2612 -5.0000e-01 0.63060
## NEG48-2 0.35355 0.2612 -5.0000e-01 0.63060
## ES48-1 0.35355 -0.4459 -3.5355e-01 -0.26120
## ES48-2 0.35355 -0.4459 -3.5355e-01 -0.26120
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$`factor(estrogen)`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`factor(time_h)`
## [1] "contr.treatment"
##
##
## $qraux
## [1] 1.3536 1.2612 1.3536 1.2612
##
## $pivot
## [1] 1 2 3 4
##
## $tol
## [1] 1e-07
##
## $rank
## [1] 4
##
##
## $df.residual
## [1] 4 4 4 4 4
## 12620 more elements ...
##
## $sigma
## 1000_at 1001_at 1002_f_at 1003_s_at 1004_at
## 0.21776 0.12833 0.11769 0.15544 0.17066
## 12620 more elements ...
##
## $cov.coefficients
## Intercept ES T48 ES:T48
## Intercept 0.5 -0.5 -0.5 0.5
## ES -0.5 1.0 0.5 -1.0
## T48 -0.5 0.5 1.0 -1.0
## ES:T48 0.5 -1.0 -1.0 2.0
##
## $stdev.unscaled
## Intercept ES T48 ES:T48
## 1000_at 0.70711 1 1 1.4142
## 1001_at 0.70711 1 1 1.4142
## 1002_f_at 0.70711 1 1 1.4142
## 1003_s_at 0.70711 1 1 1.4142
## 1004_at 0.70711 1 1 1.4142
## 12620 more rows ...
##
## $pivot
## [1] 1 2 3 4
##
## $Amean
## 1000_at 1001_at 1002_f_at 1003_s_at 1004_at
## 10.1471 5.9575 5.5645 7.9948 7.5507
## 12620 more elements ...
##
## $method
## [1] "ls"
##
## $design
## Intercept ES T48 ES:T48
## NEG10-1 1 0 0 0
## NEG10-2 1 0 0 0
## ES10-1 1 1 0 0
## ES10-2 1 1 0 0
## NEG48-1 1 0 1 0
## NEG48-2 1 0 1 0
## ES48-1 1 1 1 1
## ES48-2 1 1 1 1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$`factor(estrogen)`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`factor(time_h)`
## [1] "contr.treatment"
cont.1 <- cbind(es10=c(0,1,0,0), # contrast for the early-response targets
es48=c(0,1,0,1)) # contrast for the late-response targets
fitC.1 <- contrasts.fit(fit.1, cont.1) # returns an object of class MArrayLM
dim(fitC.1)
## [1] 12625 2
This much simpler design specifies one coefficient for each group of samples that will represent the group-mean expression. The tested contrasts are formed explicitly and therefore it is more comprehensible and less confusing for most users.
pheno <- pData(a.rma)
temp <- sub(pattern="absent", replacement="NEG", pheno[,"estrogen"])
temp <- sub(pattern="present", replacement="ES", temp)
group <- paste(temp, pheno[, "time_h"], sep="")
group <- factor(group)
design.2 <- model.matrix(~0 + group, pheno) # '~0' means no intercept coefficient
colnames(design.2) <- levels(group)
design.2
## ES10 ES48 NEG10 NEG48
## NEG10-1 0 0 1 0
## NEG10-2 0 0 1 0
## ES10-1 1 0 0 0
## ES10-2 1 0 0 0
## NEG48-1 0 0 0 1
## NEG48-2 0 0 0 1
## ES48-1 0 1 0 0
## ES48-2 0 1 0 0
## attr(,"assign")
## [1] 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
fit.2 <- lmFit(a.rma, design.2)
cont.2 <- makeContrasts(es10="ES10 - NEG10", # contrast for the early-response targets
es48="ES48 - NEG48", # contrast for the late-response targets
levels=design.2)
fitC.2 <- contrasts.fit(fit.2, cont.2)
If you are interested in the main effect of the treatment or the time-estrogen interaction effect we can define the following contrasts
cont.3 <- makeContrasts(es="(ES10 + ES48 - NEG10 - NEG48)/2",
es_time="(ES48 - NEG48 - ES10 + NEG10)", # the same meaning as the "ES:T48" coefficient in the first design
levels=design.2)
If QC plots (e.g., obtained by limma::plotMDS
) reveal batch effects associated for example with hybridization-/scan-date or lab technician or other covariate, then we can include additively this factor in the model design as follows
# lets assume that samples NEG10-1 and ES10-1 come from a different lab (and we want to take this into acount)
lab = factor(c("B","A", "B", "A", "A", "A", "A", "A"))
design.2b <- model.matrix(~0 + group + lab, pheno)
design.2b
## groupES10 groupES48 groupNEG10 groupNEG48 labB
## NEG10-1 0 0 1 0 1
## NEG10-2 0 0 1 0 0
## ES10-1 1 0 0 0 1
## ES10-2 1 0 0 0 0
## NEG48-1 0 0 0 1 0
## NEG48-2 0 0 0 1 0
## ES48-1 0 1 0 0 0
## ES48-2 0 1 0 0 0
## attr(,"assign")
## [1] 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
##
## attr(,"contrasts")$lab
## [1] "contr.treatment"
nonEstimable(design.2b) # check the rank of the matrix before fitting the model with this design
## NULL
In this case, actually we have treatment nested within the scan date effect. We can use the function limma::duplicateCorrelation
to estimate the correlation between measurements made on the same date, and include this in the model. The idea is to make all comparison between the estrogen-treated and untreated samples within the same date and block for the possible scan date effect.
corfit <- duplicateCorrelation(a.rma, design.2, block=scan.date) # it will take minute or two for calculation
corfit$consensus.correlation
## [1] 0.28131
fit.3 = lmFit(a.rma, design.2, block=scan.date, cor=corfit$consensus.correlation)
fitC.3 <- contrasts.fit(fit.3, cont.2)
The functions limma::eBayes
and limma::treat
are used to rank genes in order of DE evidence. Given a linear model fit such as fitC.1 and fitC.2, these functions compute moderated t-statistics, moderated F-statistics, or log-odds of differential expression by empirical Bayes moderation of the standard errors towards a common value.
The empirical Bayes moderated t-statistics test if each individual contrast equals to zero. For each probeset, the moderated F-statistic tests whether all the contrasts are zero. It is an overall test computed from the set of t-statistics for that probeset.
fitE <- eBayes(fitC.2)
One way to select genes targeted by estrogen at either time is to use the moderated F-statistics, the values of which are stored in fitE$F
and the corresponding p-values stored in the component fitE$F.p.value
. Lets plot the latter as well as the contrast-wise t-statistic p-values. Sharp peaks of small p-values on these plots indicate good DE detection power. What do you observe?
par(mfrow=c(1,3))
for(name_cont in colnames(fitE$p.value))
hist(fitE$p.value[,name_cont], main=name_cont, xlab="P value (moderated t-test)")
hist(fitE$F.p.value, xlab="P value (moderated F-test)", main="")
par(par.old)
However, as we have performed the test multiple times (across genes, and across contrast) we need to adjust these values before we can select a p-value cutoff. The key functions for this purpose are limma::topTable
and limma::decideTests
. The first one adjust for multiple testing across genes, while the second one offers methods for adjustment across genes and contrasts.
limma::topTable
summarizes the linear model fit object produced by eBayes
, by selecting the top-ranked genes for any given contrast. The default setup calls significantly differentially expressed genes (DEGs) by controlling the false discovery rate (FDR), the expected proportion of false discoveries among the genes called significantly differentially expressed, with the Benjamini and Hochberg’s multiple testing adjustment method (option adjust.method='fdr'
). We choose more stringent cutoff value (option p=0.01
, default is 0.05) to output genes with adjusted p-values (here FDR estimates) lower than 0.01, and we keep all genes that pass the cutoff (option n=Inf
, default is 10).
tab.f <- topTable(fitE, coef=NULL, adjust.method="fdr", p=0.01, n=Inf) # when coef=NULL (default) genes are ranked by the moderated F statistics
tab.es10 <- topTable(fitE, coef="es10", adjust.method="fdr", p=0.01, n=Inf) # when coef is specified genes are ranked by its corespondent moderated t-statistics
head(tab.es10) # preview the first couple of rows from the resulting data frame
## logFC AveExpr t P.Value adj.P.Val B
## 39642_at 2.9394 7.8765 23.717 4.7416e-09 3.1283e-05 9.9668
## 910_at 3.1137 9.6602 23.592 4.9557e-09 3.1283e-05 9.9425
## 31798_at 2.8002 12.1158 16.385 1.0257e-07 3.5111e-04 7.9773
## 41400_at 2.3810 10.0416 16.225 1.1124e-07 3.5111e-04 7.9169
## 40117_at 2.5553 9.6766 15.681 1.4729e-07 3.5762e-04 7.7051
## 1854_at 2.5076 8.5321 15.158 1.9455e-07 3.5762e-04 7.4908
nrow(tab.f) # number of DEGs for at least one of the two contrasts
## [1] 345
nrow(tab.es10) # number of DEGs for the contrast es10
## [1] 113
table(tab.es10$logFC>0)
##
## FALSE TRUE
## 23 90
Visualization by volcano plots
Volcano plots are very useful for showing the relationship between logFC and p-value. Lets show this for the contrast es10
, using the unadjusted p-values corresponding to the moderated t-statistics. Furthermore, we will compare these with the ones obtained by the ordinary two-group t-test, as calculated by the genefilter::rowttest
function. What do you observe? Are the genes with higher fold changes also ranked high according to the (moderated) t-test? Which test gives more DEGs and why?
# wrapper function for displaying volcano plots with highlights
volcanoPlot <- function(x, y, main,
xlim=range(x),
ylim=range(y),
xlab="Mean log2 fold change",
ylab="Log10 P value",
top=30)
{
# lets select the top-ranked genes according to x (log2-fold-changes)
order.x = order(abs(x), decreasing=TRUE)[1:top]
# lets select the top-ranked genes according to y
order.y = order(y, decreasing=TRUE)[1:top]
order.all = union(order.x, order.y)
# x vs y
plot(x[-order.all], y[-order.all], xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, main=main)
# highlight the selected genes
points(x[order.x], y[order.x], pch=16, col="blue")
points(x[order.y], y[order.y], pch=1, col="red")
invisible()
}
par(mfcol=c(1,2))
# moderated t-test
tab.es10 <- topTable(fitE, coef="es10", adjust.method="fdr", p=1, n=Inf) # p=1 means no p-value based gene filtering
x <- tab.es10$logFC
y <- -log10(tab.es10$P.Value)
n.deg <- sum(y>=2) # number of DEGs corresponding to p-value cutoff of 0.01
volcanoPlot(x,y, main=paste("Moderated t-test:", n.deg, "DEGs"))
abline(h=2) # add horizontal line at p=0.01
# ordinary two-group t-test
a.rma.sub <- a.rma[, c("ES10-1", "ES10-2", "NEG10-1","NEG10-2")]
tt <- rowttests(exprs(a.rma.sub), factor(a.rma.sub$estrogen, levels=c("present", "absent")))
x <- tt$dm # log2-fold-changes
y <- -log10(tt$p.value)
n.deg <- sum(y>=2) # number of DEGs corresponding to p-value cutoff of 0.01
volcanoPlot(x,y, main=paste("Ordinary t-test:", n.deg, "DEGs"))
abline(h=2) # add horizontal line at p=0.01
par(par.old)
When the sample size (number of replicates per group) are moderate or large, the ordinary t-statistics performs as well as the moderated t-statistics (there is enough data to get a good estimate of the gene-specific variance), but in the case of few replicates (e.g., two in our case) it performs poor. On the other hand, the moderated version provides better results because it borrows information across genes to provide better estimate of the gene-specific variance. Direct comparison of the gene-wise p-values from both tests will show the difference more clear.
x <- -log10(tab.es10$P.Value)
y <- -log10(tt[rownames(tab.es10), ]$p.value)
xy.lim <- c(0, max(c(x,y)))
plot(x,y, main="Test comparison (with sample size = 2)",
xlab="Log10 P value: ordinary t-test",
ylab="Log10 P value: moderated t-test",
xlim=xy.lim, ylim=xy.lim, pch=".")
abline(c(0,1), col="red") # add line x=y
Compare the results above with the same plot below that compares both statistics calculated using larger data set (artificially generated by just coping the existing data 4 times, to produce in total 8 replicates per group).
limma::decideTests
returns an object of class limma::TestResults
, which is essentially a matrix of -1, 0 or 1 elements, of the same dimension as fitE$coefficients
, indicating whether each coefficient is significantly different from zero for the selected multiple testing method.
We will compare two methods, separate (default option, adjustment down the genes per contrast) and nestedF (adjusts across contrasts). The first one is the good choice when the contrasts are not related, as testing a set of contrasts together will give the same results as when each contrast is tested on its own (equal to applying limma::topTable
). The second one aims to give greater weight to probesets which are significant for two or more contrasts. Most multiple testing methods tend to underestimate the number of such probesets.
ebay.res <- list() # empty list, to be filled with the outcome of each hypothesis test
for (p in c(0.05, 0.01)) # repeat for two different cutoff values
{
for (mmethod in c("nestedF", "separate")) # test two different multiple testing methods
{
id <- paste("eBayes_", mmethod, "_fdr", p, sep ="")
ebay.res[[id]] <- decideTests(fitE, method=mmethod, adjust.method="fdr", p.value=p) # save the outcome of the test
print(id)
print(colSums(abs(ebay.res[[id]]))) # preview DEG counts per contrast
}
}
## [1] "eBayes_nestedF_fdr0.05"
## es10 es48
## 492 664
## [1] "eBayes_separate_fdr0.05"
## es10 es48
## 326 638
## [1] "eBayes_nestedF_fdr0.01"
## es10 es48
## 230 295
## [1] "eBayes_separate_fdr0.01"
## es10 es48
## 113 250
Visualization by Venn diagrams and heatmaps
Lets make a summary of the results by Venn diagrams on a 2x2 multiplot figure. The limma::vennDiagram
function plots Venn diagrams, showing numbers of genes significant in each comparison. How many genes are up-/down-regulated in estrogen-treated vs untreated cells after 10 hours and 48 hours, using the different test?
par(mfcol=c(2,2))
sapply(names(ebay.res),
function(name)
{
vennDiagram(ebay.res[[name]],
include=c("up", "down"), # show counts for up- and down-regulated DEGs
mar=c(0,0,0,0), # the width of the margins around the plot
counts.col=c("red", "blue"), # color for the counts
cex=c(1, 0.8, 0.7), # scaling factors for text on the plot
circle.col=c("green", "orange"), # color for the contrast
)
title(name, line=1, cex.main=0.9)
})
par(par.old)
Lets visualize the results by heatmaps (with function gplot::heatmap.2
), a useful tool to identify collection of genes that show similar patterns of up- or down-regulation by the estrogen treatment. We will generate two maps, one for the gene affected at both 10 and 48 hours, and one for the genes affected only at 10h. Can you see patterns in the gene expression? Are these consistent across replicates or conditions?
res <- abs(ebay.res[["eBayes_separate_fdr0.01"]]) # mark gene as DEG or not (ignore the direction)
list.ids <- list()
list.ids[["es10"]] <- names(which(rowSums(res)==1 & res[ ,"es10"]==1))
list.ids[["es10_48"]] <- names(which(rowSums(res)==2))
colors.hm <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)) # blue (low expr.) to red (high espr.) color palette with 256 shades
sel.exp <- exprs(a.rma)[unlist(list.ids, use.names=FALSE), ] # subset the expression matrix
for(name in names(list.ids))
{
title.text <- paste(name, ": log2 expression of", length(list.ids[[name]]), "targets")
mat <- sel.exp[list.ids[[name]], ] # expression for selected probeset ids
mat <- t(mat) # transpose, so the samples are in the rows
heatmap.2(mat, col=colors.hm,
dendrogram="col", # display gene reordering (default clustering method)
Rowv=FALSE, # no sample reordering
main=title.text,
breaks=seq(min(sel.exp), max(sel.exp), length.out=257), # same range of expr. values for both plots
keysize=1, key.ylab=NA, key.xlab=NA, key.title="logFC", # the color-key parameters
density.info="none", trace="none", mar=c(6,6))
}
If we want to prioritize for both biological relevant and statistically significant results then we can use the limma::treat
function. This computes empirical Bayes moderated t p-values relative to a minimum required logFC threshold in absolute value for each gene and contrast. No moderated F-statistics is computed. Gene ranking summaries can be obtained with the limma::topTreat
function, in this case with adjustment for testing down the genes per contrast with the Benjamini and Hochberg’s method.
fitT <- treat(fitC.2, lfc=1) # |logFC| > 1, unlike |logFC| > 0 in eBayes()
treat.res <- list() # empty list, to be filled with the outcome of each hypothesis test
for (p in c(0.05, 0.01)) # test two different cutoff values
{
e10 <- topTreat(fitT, coef="es10", adjust.method="fdr", p.value=p, n=Inf)
e48 <- topTreat(fitT, coef="es48", adjust.method="fdr", p.value=p, n=Inf)
# we have to provide the summary DEG matrix, no decideTest-alike function for this case
mat <- matrix(0, ncol=2, nrow=nrow(fitT),
dimnames=list(rownames(fitT),
colnames(fitT))) # initialize zero-valued matrix
probesets <- list("up"=list("es10"=rownames(e10[e10$logFC>0, ]),
"es48"=rownames(e48[e48$logFC>0, ])),
"down"=list("es10"=rownames(e10[e10$logFC<0, ]),
"es48"=rownames(e48[e48$logFC<0, ]))
) # saves ids of DEGs in a nested list
junk <- sapply(names(probesets),
function(type){
sapply(names(probesets[[type]]),
function(deg){
ids <- probesets[[type]][[deg]]
if(length(ids)>0)
# to change values in global variable, such as 'mat',
# inside a function use operator <<-
mat[ids, deg] <<- ifelse(type=="up", 1, -1)
})
}) # fills the matrix 'mat'
id <- paste("treat_logFC>", fitT$treat.lfc,"_fdr", p, sep ="")
treat.res[[id]] <- mat # save the outcome
print(id)
print(colSums(abs(treat.res[[id]]))) # preview DEG counts per contrast
}
## [1] "treat_logFC>1_fdr0.05"
## es10 es48
## 11 21
## [1] "treat_logFC>1_fdr0.01"
## es10 es48
## 3 6
Lets summarize the results via Venn diagrams. Compare the plot with the one from the eBayes-based results.
par(mfcol=c(1,2))
sapply(names(treat.res),
function(name)
{
vennDiagram(treat.res[[name]],
include=c("up", "down"), # show counts for up- and down-regulated DEGs
mar=c(0,2,0,0), # the width of the margins around the plot
counts.col=c("red", "blue"), # color for the counts
cex=c(1, 0.8, 0.7), # scaling factors for text on the plot
circle.col=c("green", "orange"), # color for the contrast
asp=1
)
title(name, line=-3, cex.main=0.9)
})
par(par.old)
1. Annotation
Now that have selected interesting genes by DE, we need to link the platform-dependent IDs (in this case, Affymetrix probeset IDs) to biologically meaningful information such as gene names, gene functions etc. For this purpose we need the appropriate annotation packages, i.e., in this case hgu95av2
, which is already attached. Lets see what annotation information is provided with this package.
ls("package:hgu95av2.db")
## [1] "hgu95av2" "hgu95av2ACCNUM" "hgu95av2ALIAS2PROBE" "hgu95av2CHR" "hgu95av2CHRLENGTHS"
## [6] "hgu95av2CHRLOC" "hgu95av2CHRLOCEND" "hgu95av2.db" "hgu95av2_dbconn" "hgu95av2_dbfile"
## [11] "hgu95av2_dbInfo" "hgu95av2_dbschema" "hgu95av2ENSEMBL" "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"
## [16] "hgu95av2ENZYME" "hgu95av2ENZYME2PROBE" "hgu95av2GENENAME" "hgu95av2GO" "hgu95av2GO2ALLPROBES"
## [21] "hgu95av2GO2PROBE" "hgu95av2MAP" "hgu95av2MAPCOUNTS" "hgu95av2OMIM" "hgu95av2ORGANISM"
## [26] "hgu95av2ORGPKG" "hgu95av2PATH" "hgu95av2PATH2PROBE" "hgu95av2PFAM" "hgu95av2PMID"
## [31] "hgu95av2PMID2PROBE" "hgu95av2PROSITE" "hgu95av2REFSEQ" "hgu95av2SYMBOL" "hgu95av2UNIGENE"
## [36] "hgu95av2UNIPROT"
Lets obtain the gene symbols and Entrez ID and save it in a data frame for further use. For this we need the following functions from the package annotate
, already attached. Note that all annotations are subjected to changes over time, therefore you should always report the version of the annotation package with the results. You will find in the reference manual of the package the actual version of each database used.
pids <- rownames(fitE)
ann <- annotation(a.rma)
syms <- getSYMBOL(pids, ann) # get the gene symbols
egs <- getEG(pids, ann) # get the Entrez Gene IDs
gene.df <- data.frame(PROBESET=pids, SYMBOL=syms, ENTREZ_ID=egs)
head(gene.df)
## PROBESET SYMBOL ENTREZ_ID
## 1000_at 1000_at MAPK3 5595
## 1001_at 1001_at TIE1 7075
## 1002_f_at 1002_f_at CYP2C19 1557
## 1003_s_at 1003_s_at CXCR5 643
## 1004_at 1004_at CXCR5 643
## 1005_at 1005_at DUSP1 1843
Lets check the gene frequency on the chip, i.e., how many probesets map to each gene, then aggregate this across all genes and see how many genes have been represented with multiple probesets. Some probesets don’t map to any gene (NA value in the table).
paste("Number of genes (with Entrez ID) represented by at least probeset on the chip: ",
length(unique(gene.df$ENTREZ_ID))
)
## [1] "Number of genes (with Entrez ID) represented by at least probeset on the chip: 8595"
paste("Number of probesets that haven't been annotated with Entrez IDs: ",
sum(is.na(gene.df$ENTREZ_ID))
)
## [1] "Number of probesets that haven't been annotated with Entrez IDs: 1155"
head(table(gene.df$SYMBOL)) # probesets per gene
##
## AADAC AAK1 AAMP AANAT AARS AASDHPPT
## 1 5 1 1 1 1
table(table(gene.df$SYMBOL)) # lists the number of genes that have 1,2... X probesets on the chip
##
## 1 2 3 4 5 6 7 8
## 6563 1429 451 102 22 16 6 5
2. Save results in a file
Now we are ready to combine the annotation data with the DE results and save them in a tab-separated file format.
resT.1 <- topTreat(fitT, coef="es10", adjust.method="fdr", genelist=gene.df, n=Inf)
write.table(resT.1, file="toptreat_ES10h_lfc1.tvs", sep="\t", row.names=FALSE, col.names=TRUE)
resT.2 <- topTreat(fitT, coef="es48", adjust.method="fdr", genelist=gene.df, n=Inf)
write.table(resT.2, file="toptreat_ES48H_lfc1.tvs", sep="\t", row.names=FALSE, col.names=TRUE)
head(resT.1)
## PROBESET SYMBOL ENTREZ_ID logFC AveExpr t P.Value adj.P.Val
## 910_at 910_at TK1 7083 3.1137 9.6602 16.0154 6.2137e-08 0.0004741
## 39642_at 39642_at ELOVL2 54898 2.9394 7.8765 15.6485 7.5105e-08 0.0004741
## 31798_at 31798_at TFF1 7031 2.8002 12.1158 10.5337 1.8496e-06 0.0077838
## 40117_at 40117_at MCM6 4175 2.5553 9.6766 9.5441 4.0210e-06 0.0113297
## 41400_at 41400_at TK1 7083 2.3810 10.0416 9.4105 4.4870e-06 0.0113297
## 1854_at 1854_at MYBL2 4605 2.5076 8.5321 9.1135 5.7625e-06 0.0121253
3. HTML reports
We can provide similar reports with more biologically relevant information (with links to the original annotation sources) in a HMTL format by using the package annaffy
as follows.
selected.ids <- rownames(resT.1)[1:10] # top 10 early-response estrogen targets
library(annaffy)
aaf.handler() # which information is available
## [1] "Probe" "Symbol" "Description" "Chromosome" "Chromosome Location" "GenBank"
## [7] "Gene" "Cytoband" "UniGene" "PubMed" "Gene Ontology" "Pathway"
which.info <- c(1:3,7,11:12) # which information we will select
ann.table <- list()
ann.table[[1]] <- aafTableAnn(selected.ids, "hgu95av2.db", aaf.handler()[which.info]) # generates annotation table
# lets add some more information, e.g., logFC and AveExpr (average log2-expression for the probeset over all samples), p-value
ann.table[[2]] <- aafTable(logFC=resT.1[selected.ids, "logFC"], signed=TRUE)
ann.table[[3]] <- aafTable(AvgExpr=resT.1[selected.ids, "AveExpr"])
ann.table[[4]] <- aafTable(adj.P.Value=resT.1[selected.ids,"adj.P.Val"])
merged.table <- ann.table[[1]]
for(i in 2:4)
merged.table <- merge(merged.table, ann.table[[i]])
saveHTML(merged.table, filename="ES10_top10.html",
title <- "Top 10 early-response estrogen targets with statisticaly significant |logFC| > 1 ")
Perform gene expression analysis on the data GSE20986
The goal of the study by Browning et al. (2012) was to find whether the human umbilical vein endothelial cells (huvec) are suitable surrogate when studying pathophysiological mechanisms of ocular disorders. The first step towards the answer is to compare the gene expression profiles of unpassaged, proliferating (huvec) and human iris, retinal and choroidal microvascular endothelial cells.
hgu133plu2.db
and hgu133plu2.cdf
for Affymetrix GeneChip Human Genome U133 Plus 2.0 arraysrma
.Books that cover microarray analysis
Hahne et al. (2008). Bioconductor case studies
Gentleman et al. editors (2005). Bioinformatics and Computational Biology Solutions using R and Bioconductor
This tutorial is based on the material presented in these books.
Articles
Huber et al. (2015). Orchestrating high-throughput genomic analysis with Bioconductor
Ritchie et al. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies
Davis and Meltzer (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor
Scholtens et al. (2004). Analyzing factorial designed microarray experiments
Storey and Tibshirani (2003). Statistical significance for genomewide studies
Other web resources/tools
Bioconductor training materials, courses on (various) genomic data analysis.
RStudio cheat sheets, useful if you plan to use more often Rstudio.
GEO2R, an interactive web tool that allows users to compare two or more groups of samples (from a given GEO accession) in order to identify genes that are differentially expressed across experimental conditions.