This is a step by step tutorial for differential gene expression (DE) analysis in the statistical environment R. It will focus on expression data obtained by microarrays, a high-throughput technology for genome-wide expression profiling. The analysis will be done on an example data set resulting from a larger multi-factor experimental design that involves estrogen receptor positive breast cancer cell lines with the aim to identify early- and late-response targets of estrogen treatment. For literature and software references see section References.

Requirements: Installed recent versions of R or RStudio on your computers.

1 Getting started

1.1 Installing the necessary R packages

For this tutorial, you need to install several packages from Bioconductor, an open-source software development project that provides tools for the analysis and comprehension of high-throughput genomic data. Other useful R packages and documentation are available from the Comprehensive R Archive Network CRAN.

# set two lists with the names of the Biocodnuctor and the CRAN packages
bioc.libs <- c("Biobase", "affy", "annotate", "hgu95av2.db", "hgu95av2cdf", "estrogen",
               "limma", "genefilter", "vsn", "GEOquery", "annaffy", "GO.db","KEGG.db")
cran.libs <- c("RColorBrewer", "R.utils", "gplots")

Before we continue with installing, lets check for available updates on the already available packages, using the function biocLite from the Bioconductor package BiocInstaller. If invoked with no arguments it installs core packages (when Bioconductor is installed for the first time) or updates already installed packages.

source("http://bioconductor.org/biocLite.R") # get BiocInstaller
biocLite() # if they are outdated packages, choose the option to update all the packages

After the update, restart the R session (in Rstudio go to menu tab Session -> Restart R). Now load the BiocInstaller package and finally install the packages saved in the lists above.

library(BiocInstaller)
biocLite(bioc.libs)
biocLite(cran.libs)

After successful installation, restart the R session again, and check the status of the available packages: in your case, everything should be up to date.

library(BiocInstaller)
biocValid() # should return TRUE

1.2 R help

A short reminder on how to get help for various topics (e.g., packages, functions, operators, concepts) as follows
help() or ? is primary help for functions and features,
help.search() or ?? finds help pages on a vague topic,
library() lists available packages and the help objects they contain,
data() lists available data sets,
ls lists names of the objects in a specified environment.

help(package="affy") # get help even when package is not loaded
?? affy
help.search("affy") 
? sapply #help(sapply)  
#' some topics, such as operators or control-flow reserved words, need to be quoted
? "if"
? "+"
## open the vignette for the Biobase package (the package has to be already attached)
library(Biobase) # base functions for Bioconductor
openVignette("Biobase") 
# list the content of the package (the package has to be already attached)
ls("package:Biobase")[1:12] # the list is long, so display the first 12 entries only
##  [1] "abstract"                    "addVigs2WinMenu"             "aggenv"                      "aggfun"                     
##  [5] "Aggregate"                   "AnnotatedDataFrame"          "annotatedDataFrameFrom"      "annotation"                 
##  [9] "annotation<-"                "anyMissing"                  "as.data.frame.ExpressionSet" "assayData"
# shows the user-defined datasets and functions in the current enviorment
ls() 
## [1] "bioc.libs" "cran.libs"

1.3 Set up a working directory

It is quite handy to have a project-specific working directory, where the R process will save by default plots or data of the project-related analysis. This can be done by using the function setwd.

main.dir <- file.path(getwd(), "tutorial_project") # make file path in a platform-independent way
dir.create(main.dir) # create a subdirectory for the current project
setwd(main.dir) # set the working directory
getwd() # display the current working directory
## [1] "/home/taskova/R/teaching/tutorial_project"

1.4 Load and attach packages

You can always attach an available (installed) package later in the script, just before using its functions and/or data. For simplicity, we will include most of the packages at this point.

library(affy) # data analysis and exploration of Affymetrix array probe level data
library(annotate) # annotation functionality  
library(hgu95av2.db) # gene chip specific metadata
library(hgu95av2cdf) # info relating probe pair sets to locations on the array
library(estrogen) # data for our analysis 
library(genefilter) # gene filtering functionality 
library(limma) # linear modeling and statistical testing for differential expression 
library(R.utils) # additional useful functions
library(RColorBrewer) # nice looking color palettes
library(gplots) # heatmap function

sessionInfo is an important function for reproducibility of results. It prints version information about R, the OS and attached or loaded packages. Best practice is to include it as the last command in the R script and save the result in a file.

sessionInfo() # output on the screen
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 15.10
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_2.17.0        RColorBrewer_1.1-2   R.utils_2.1.0        R.oo_1.19.0          R.methodsS3_1.7.0    limma_3.24.14       
##  [7] genefilter_1.50.0    estrogen_1.14.0      hgu95av2cdf_2.16.0   hgu95av2.db_3.1.3    org.Hs.eg.db_3.1.2   RSQLite_1.0.0       
## [13] DBI_0.3.1            annotate_1.46.1      XML_3.98-1.3         AnnotationDbi_1.30.1 GenomeInfoDb_1.4.1   IRanges_2.2.5       
## [19] S4Vectors_0.6.2      affy_1.46.1          Biobase_2.28.0       BiocGenerics_0.14.0 
## 
## loaded via a namespace (and not attached):
##  [1] BiocInstaller_1.18.5  formatR_1.2           bitops_1.0-6          tools_3.2.2           zlibbioc_1.14.0       digest_0.6.8         
##  [7] evaluate_0.8          preprocessCore_1.30.0 yaml_2.1.13           stringr_1.0.0         knitr_1.12.3          caTools_1.17.1       
## [13] gtools_3.5.0          survival_2.38-3       rmarkdown_0.9.2       gdata_2.17.0          magrittr_1.5          codetools_0.2-14     
## [19] htmltools_0.3         splines_3.2.2         xtable_1.7-4          KernSmooth_2.23-15    stringi_1.0-1         affyio_1.36.0
capture.output(sessionInfo(), file="session.log") # output to a file in the working directory

2 Microarray data preprocessing (30 min)

2.1 Data import

Estrogen dataset

In this case, the data are provided by the package estrogen, which includes 8 samples from an estrogen receptor positive breast cancer human cell lines. After serum starvation, 4 samples were exposed to estrogen and 4 samples were left untreated. Each group of samples was profiled with Affymetrix HGU95Av2 gene chips after 10 hours (2 samples) and 48 hours (2 samples). The complete description of the data, that involves in total 32 samples from a experiment with 4 factors (estrogen treatment, time, drug treatment, translational inhibitor treatment), is given by Scholtens et al. (2004).

We can access the raw data, i.e., the probe intensity level data in .CEL format, as outputted by the Affymetrix software. These files are located in a subdirectory called extdata belonging to the corresponding package on your computer’s hard disk. The function system.file will help you find the exact location. For your own data, put all of the .CEL files in the same directory and specify the full path to it.

data.dir <- system.file("extdata", package="estrogen") # or provide the path
data.dir
## [1] "/home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata"
dir(data.dir) # lists the content of the directory 
##  [1] "bad.cel"       "estrogen.txt"  "high10-1.cel"  "high10-2.cel"  "high48-1.cel"  "high48-2.cel"  "low10-1.cel"   "low10-2.cel"  
##  [9] "low48-1.cel"   "low48-2.cel"   "phenoData.txt" "targLimma.txt"

GEO datasets

If you use data from the publicly available repositories of microarray data, such as the NCBI Gene Expression Omnibus (GEO), then the GEOquery package is quite handy. A single function, getGEOSuppFiles, can take as an argument a GEO accession and will download all the raw data associate with that accession. By default, the function will create a directory in the current working directory to store the raw data for the chosen GEO accession. The following code chunk is illustrates how to get the data with accession number GSE20986.

library(GEOquery) # load the package
geo <- getGEOSuppFiles("GSE20986")  # directory 'GSE20986' was created in the working directory

# data are compressed and archived so we have to extract them before using them
geo.dir <- file.path("GSE20986","data")
untar(file.path("GSE20986","GSE20986_RAW.tar"), 
      exdir=geo.dir # unarchive into subdirectory 'data' of directory 'GSE20986'
      ) 
cel.files <- list.files(geo.dir, pattern=".gz$", full.names=TRUE) # list files with file extension '.gz'
sapply(cel.files, gunzip) # uncompress each file with the function 'gunzip'
dir(geo.dir)
##  [1] "GSM524662.CEL" "GSM524663.CEL" "GSM524664.CEL" "GSM524665.CEL" "GSM524666.CEL" "GSM524667.CEL" "GSM524668.CEL" "GSM524669.CEL"
##  [9] "GSM524670.CEL" "GSM524671.CEL" "GSM524672.CEL" "GSM524673.CEL"
# now we can set data.dir, but we won't do it as our focus is on the estrogen data
# data.dir<-file.path(getwd(), "data_geo")

Sample-description data

The Biobase::AnnotatedDataFrame object is designed to standardize the storage and manipulation of phenotypic data and its metadata.
– Phenotypic data summarizes information about the samples (e.g., sex, age, and treatment status; referred to as covariates). The information can be represented as a table with S rows and V columns, where V is the number of covariates and S is the number of samples. These data can be accessed/set with the pData function.
– Metadata about the covariates, providing for each longer description, possible values, units etc.). These data can be accessed/set with the function varMetadata.
See the vignette of the Biobase package for more information.

The file estrogen.txt contains information on the samples that were hybridized onto the chip arrays. It can be used to create an AnnotatedDataFrame instance by calling the function Biobase::read.AnnotatedDataFrame. The header option tells the function that the first line in the file contains column headings, and the row.names option indicates that the first column of the file contains the row names. Note that you can always create similar text files in text editors of your preference, and then load them in R.

pd <- read.AnnotatedDataFrame(file.path(data.dir,"estrogen.txt"), 
                              header=TRUE, sep="", row.names=1)
pd
## An object of class 'AnnotatedDataFrame'
##   rowNames: low10-1.cel low10-2.cel ... high48-2.cel (8 total)
##   varLabels: estrogen time.h
##   varMetadata: labelDescription
sampleNames(pd) # the names of the samples, file names in this case
## [1] "low10-1.cel"  "low10-2.cel"  "high10-1.cel" "high10-2.cel" "low48-1.cel"  "low48-2.cel"  "high48-1.cel" "high48-2.cel"
pData(pd) # returns a data frame with samples as rows and covariates as columns
##              estrogen time.h
## low10-1.cel    absent     10
## low10-2.cel    absent     10
## high10-1.cel  present     10
## high10-2.cel  present     10
## low48-1.cel    absent     48
## low48-2.cel    absent     48
## high48-1.cel  present     48
## high48-2.cel  present     48
varMetadata(pd) # returns description of the covariates (empty at the moment)
##          labelDescription
## estrogen                 
## time.h

You can also create the data frame object from scratch (when no text file is provided). The minimal information needed is a data frame of phenotypic data. For better readability and no confusion, it is always good to provide description of the covariates as metadata.

pd.new <- AnnotatedDataFrame()
df <- data.frame(estrogen=rep(rep(c("absent", "present"), each=2), times=2),
                 time_h=rep(c(10, 48), each=4),
                 replicate=rep(c(1,2), times=2),
                 estrogen_tmp=rep(rep(c("low", "high"), each=2), times=2)) # if you prepare the data frame
rownames(df) <- apply(df, 1, 
                      function(item) # makes the sample name, i.e. file name
                          {paste(item["estrogen_tmp"],
                                 item["time_h"],"-", 
                                 item["replicate"], 
                                 ".cel", sep="")
                           }
                      ) 
pData(pd.new) <- df[, 1:3] # remove the last column, not needed
varMetadata(pd.new) <- data.frame(labelDescription=
                                      c("Estrogen treatment status with 2 values, absent  and present",
                                        "Time of measurement after treatment with 2 values, 10 and 48 hours",
                                        "Biological replicate id within exper. group with 2 values, 1 and 2"),
                                  row.names=colnames(pd.new))
pData(pd.new)
varMetadata(pd.new)
##              estrogen time_h replicate
## low10-1.cel    absent     10         1
## low10-2.cel    absent     10         2
## high10-1.cel  present     10         1
## high10-2.cel  present     10         2
## low48-1.cel    absent     48         1
## low48-2.cel    absent     48         2
## high48-1.cel  present     48         1
## high48-2.cel  present     48         2
##                                                             labelDescription
## estrogen        Estrogen treatment status with 2 values, absent  and present
## time_h    Time of measurement after treatment with 2 values, 10 and 48 hours
## replicate Biological replicate id within exper. group with 2 values, 1 and 2

Probe level data

Importing probe intensities from .CEL files is done with the function affy::ReadAffy. This function returns an affy::AffyBatch object that stores both probe-level intensities and metadata for the samples. This object can store also other data such as:
experimentData summarizes experiment-level data, e.g. the investigator or lab where the experiment was done,
featureData keeps metadata about the features (probes) on the gene chip,
protocolData keeps protocol data for sample processing.
Note that if the function is called with no arguments all the CEL files (if any) in the working directory are read. Since our files are not in the working directory, we have to specify a character vector with the file paths.

files <- list.celfiles(path=data.dir, full.names=TRUE) # list of absolute paths to all .CEL files 
files <- files[!grepl(pattern="bad.cel", files)] # grepl returns a logical vector (pattern match or not for each element of files), operator ! means logical negation
a <- ReadAffy(filenames=files, # pass the list of .CEL files
              verbose=TRUE)
## 1 reading /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-1.cel ...instantiating an AffyBatch (intensity a 409600x8 matrix)...done.
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high48-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high48-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low10-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low10-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low48-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low48-2.cel
a 
## AffyBatch object
## size of arrays=640x640 features (20 kb)
## cdf=HG_U95Av2 (12625 affyids)
## number of samples=8
## number of genes=12625
## annotation=hgu95av2
## notes=
pData(protocolData(a)) # scan date for each array (important, as can be potential source of batch effect)
##                       ScanDate
## high10-1.cel 06/28/02 13:30:01
## high10-2.cel 06/27/02 13:01:01
## high48-1.cel 01/11/02 13:32:10
## high48-2.cel 01/14/02 12:59:04
## low10-1.cel  06/27/02 13:34:19
## low10-2.cel  06/28/02 14:10:26
## low48-1.cel  01/11/02 13:20:16
## low48-2.cel  01/14/02 12:47:24

When passing the phenotypic data frame, be careful about two things. First, number of samples has to be equal to the number of .CEL files, and second the order has to match as well.

all(rownames(pd.new)==colnames(a))
rownames(pd.new)
colnames(a)
## [1] FALSE
## [1] "low10-1.cel"  "low10-2.cel"  "high10-1.cel" "high10-2.cel" "low48-1.cel"  "low48-2.cel"  "high48-1.cel" "high48-2.cel"
## [1] "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel"  "low10-2.cel"  "low48-1.cel"  "low48-2.cel"
a <- a[,rownames(pd.new)] # change the order of the samples to fit to pd.new 
phenoData(a) <- pd.new
all(rownames(pData(a))==colnames(a))
## [1] TRUE

We can always customize the information in a format suitable for us. For example, lets replace the sample names (the original file names) with shorter ones.

sample.names <- sub(pattern=".cel", replacement="", colnames(a))
sample.names <- sub(pattern="low", replacement="NEG", sample.names) 
sample.names <- sub(pattern="high", replacement="ES", sample.names) 
print(sampleNames(a) <- sample.names)
## [1] "NEG10-1" "NEG10-2" "ES10-1"  "ES10-2"  "NEG48-1" "NEG48-2" "ES48-1"  "ES48-2"

2.2 From probe intensity to gene expression values

Obtaining gene expression level data involves three steps:
– Background adjustment, to correct for non-specific hybridization and optical detection noise.
– Normalization, to make the different arrays comparable.
– Summarization, to obtain summary expression values (per probeset) from the probe intensities. One gene can be represented with more than one probeset.

affy has many methods for each step, that can be selected by using the two general functions expresso and threestep, or by special functions to produce specific expression measures such as rma or mas5.

bgcorrect.methods() # list available background adjustment methods
## [1] "bg.correct" "mas"        "none"       "rma"
normalize.AffyBatch.methods() # lists available normalization methods for AffyBatch objects
## [1] "constant"         "contrasts"        "invariantset"     "loess"            "methods"          "qspline"          "quantiles"       
## [8] "quantiles.robust"
express.summary.stat.methods() # lists available expression summarization methods
## [1] "avgdiff"      "liwong"       "mas"          "medianpolish" "playerout"

For example a call to expresso with mas background adjustment (the same is used in the Affymetrix MAS 5.0 software), quantile normalization and summarization by averaging probe intensities (only the PM probes)

a.exp <- expresso(a, bgcorrect.method="mas", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method="medianpolish")

We will use the rma function, which implements the Robust Multi-array Average (RMA) algorithm that combines convolution background correction, quantile normalization and median-polish-based multi-array summarization. Note that the function returns expression values on a log2 scale.

a.rma <- rma(a)
## Background correcting
## Normalizing
## Calculating Expression
a.rma
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12625 features, 8 samples 
##   element names: exprs 
## protocolData
##   sampleNames: NEG10-1 NEG10-2 ... ES48-2 (8 total)
##   varLabels: ScanDate
##   varMetadata: labelDescription
## phenoData
##   sampleNames: NEG10-1 NEG10-2 ... ES48-2 (8 total)
##   varLabels: estrogen time_h replicate
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2

The obtained data is an instance of the Biobase::ExpressionSet object that has standardized structure similar to the one of affy::AffyBatch. It stores probeset-summarized expression values instead of probe intensities. Explore it with the help of the package documentation.

class(a.rma)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
class(exprs(a.rma))
## [1] "matrix"
head(exprs(a.rma))
##           NEG10-1 NEG10-2  ES10-1 ES10-2 NEG48-1 NEG48-2  ES48-1 ES48-2
## 1000_at   10.3982 10.2544 10.0040 9.9035 10.3749 10.0335 10.3451 9.8633
## 1001_at    5.7176  5.8810  5.8596 5.9540  5.9605  6.0209  5.9811 6.2852
## 1002_f_at  5.5126  5.8018  5.5711 5.6081  5.3901  5.4945  5.5081 5.6301
## 1003_s_at  7.7839  8.0080  8.0380 7.8351  7.9265  8.1389  7.9949 8.2333
## 1004_at    7.2892  7.6037  7.4885 7.7715  7.5218  7.5995  7.4561 7.6752
## 1005_at    9.2067  8.9938  8.2379 8.3380  9.1732  9.0405  7.9261 8.0697

2.3 Data quality assessment

Before we proceed with differential expression analysis, we need to perform quality control (QC) assessment and make sure there are no issues with the data. There are many tools that can produce informative reports, including different QC criteria such as expression distribution, RNA degradation. You can have a look at the Bioconductor packages, arrayQualityMetrics and affyExpress. Here, we will show several diagnostic plots for graphical exploration of the data quality before and after normalization.

Array surface images

The affy::image function outputs the spatial distribution of the intensities on a chip, and allows to look for spatial artifacts. In our case, none of the array has problems, but for illustration purposes we include one bad array (bad.cel). We will plot all arrays, including the bad one, on one multiplot figure. The options in the par function (from the base package graphics) can be used to set the layout, as follows. You can uncomment the second and the last line if you want the figure to be saved in .jpeg file.

a.bad = ReadAffy(filenames = file.path(data.dir,"bad.cel"), verbose=TRUE)
# jpeg("estrogen_array_images.jpg", quality=100)
par.old = par(no.readonly = TRUE) # save the default par options
par(mfcol=c(3,3), mar=c(0,3,3,1))
sapply(1 : (length(a) + 1),
       function(x){
           if(x==9){
               image(a.bad, main="bad.cel")
               }
           else{
               image(a[, x], main=colnames(a[,x]))
               }
       })
par(par.old) #return to the default setting
# dev.off() 

Boxplots and histograms

The distribution of the unprocessed probe intensity as well as the post-normalized expression values can reveal problems with single arrays as well as issues with the normalization procedure. This can be displayed with boxplots or histograms (density estimators). For example, bimodal distribution in the probe intensity distribution can indicate spatial artifact as that seen for the bad array. Lets plot these for our data. Are the distribution all the same? Do you see any batch effects? Did the normalization reduced the differences in the raw data?

par(mfrow =c(2,2), mar=c(2,3,3,1))
color.set <- brewer.pal(8, "Dark2") # select a color palette
#set the plot limits
lim=c(min(exprs(a.rma)), # min y-axis value
       max(max(log2(intensity(a))), max(exprs(a.rma))) # max y-axis value
      )
boxplot(a, col=color.set, ylim=lim, las=2, main="Before normalization")
boxplot(exprs(a.rma), col=color.set, ylim=lim, las=2, main="After normalization with RMA")
hist(a, col=color.set, xlim=lim, main="")
plotDensities.default(exprs(a.rma), col=color.set, main="", legend=FALSE) # from limma package 
par(par.old)

Unsupervised sample clustering

We can also look at the relationships between the samples using hierarchical clustering (e.g., the dendrogram obtained by the function stat::hclust and visualized with plot) or multidimensional scaling (MDS) as obtained with the functionlimma::plotMDS. Both plots show that the samples from the same group cluster together, with the replicates taken 48h after estrogen treatment showing the largest within-group dissimilarty after normalization.

par(mfrow =c(1,2), mar=c(0,3,3,1))

# dendrogram based on probe intensities
distance <- as.dist(1-cor(exprs(a), method="pearson")) # convert the correlation matrix into a distance matrix
clusters <- hclust(distance) # perform hierarchical clustering based on the given sample distance 
plot(clusters, main="Before normalization", xlab="", sub="")

# dendrogram based on summarized expression
distance.rma <- as.dist(1-cor(exprs(a.rma), method="pearson")) # convert the correlation into a distance matrix
clusters.rma <- hclust(distance.rma) 
plot(clusters.rma, main="After normalization with RMA", xlab="",sub="")
par(par.old)

# MDS plot
scan.date <- sapply(pData(protocolData(a))[,1], # gets the ScanDate column
                    function(x) unlist(strsplit(x, split=" "))[1], # filter the time, keep only the date
                    USE.NAMES=FALSE)
color.set2 <- c("red", "violet", "green","blue")
names(color.set2) <- unique(scan.date)
plotMDS(a.rma, col=color.set2[scan.date], labels=sampleNames(a.rma), 
        main="Multidimensional scaling plot of RMA-processed data \n(we have a good separation between the groups)")
legend("topright", title="Scan date", legend=names(color.set2), 
       col=color.set2, pch=15, cex=0.8, bty="n") 

RNA degradation

The function affy::AffyRNAdeg returns information to detect RNA degradation that can be displayed with affy::plotAffyRNAdeg.
Within each probeset, probes are numbered directionally from the 5’ end to the 3’ end. Probe intensities (scaled and shifted) are averaged by probe number, across all probesets. Linear regression models are fit to these average values per sample, and the resulting slope estimates can be used as a degradation measure. High slopes indicate degradation, however more important than the slope is the agreement between arrays.

 deg <- AffyRNAdeg(a, log.it=FALSE)
 data.frame(sample=deg$sample.names, slope=deg$slope)
##    sample     slope
## 1 NEG10-1  0.146641
## 2 NEG10-2  0.132972
## 3  ES10-1  0.150990
## 4  ES10-2  0.058095
## 5 NEG48-1 -0.126269
## 6 NEG48-2 -0.113679
## 7  ES48-1 -0.007458
## 8  ES48-2 -0.186991
 plotAffyRNAdeg(deg, col=color.set, asp=1)
 legend("topright", title="", legend=deg$sample.names, 
         col=color.set, lty=1, lwd=2, cex=0.8, bty="n") 


3 Differential expression analysis with limma (30 min)

The Bioconductor package limma provides an integrative solution for data analysis of gene expression data. It offers tools for handling complex experimental designs, small sample sizes, data from different platform (microarray, PCR, protein arrays, RNA-seq etc.) as well as nice diagnostic plots for quality assessment and higher-level analysis tools such as the one for gene set testing (the overall significance of a set of co-regulated genes).

3.1 Gene-wise linear model

The core element of DE withlimma is the fit of gene-wise linear model to the gene expression data. This is done by estimating the log-intensities (for single-channel data, as in our case) or the log-ratios (for two-channel data) between two or more RNA samples simultaneously. The lmFit function allows us to fit a model robustly or by least square estimation (default). Once the model is fitted, the makeContrasts function can be used to form the contrast (comparison of interest) matrix, which together with the model is used by the contrasts.fit function to calculate the mean log2-fold-changes (logFC) and t-statistics for the contrasts of interest. In the following we will see two ways of specification of the model parameters, i.e., model design matrix. Both can be used to answer the same question, i.e., which are the early- and late-response targets of estrogen treatment.

3.1.1 Treatment-contrast parametrization of the linear model

In the case of estrogen data, we have two factors, time and estrogen treatment. The model is designed to reflect the effect of time in the absence of estrogen (T48), the effect of estrogen (ES) and the time-estrogen interaction effect (ES:T48), assuming the intercept represents the average log2 expression at 10 hours in absence of treatment.

pheno <- pData(a.rma)
design.1 <- model.matrix(~factor(estrogen) * factor(time_h), pheno)
colnames(design.1) <- c("Intercept", "ES","T48","ES:T48") # one name for each model coefficient 
design.1
##         Intercept ES T48 ES:T48
## NEG10-1         1  0   0      0
## NEG10-2         1  0   0      0
## ES10-1          1  1   0      0
## ES10-2          1  1   0      0
## NEG48-1         1  0   1      0
## NEG48-2         1  0   1      0
## ES48-1          1  1   1      1
## ES48-2          1  1   1      1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$`factor(estrogen)`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`factor(time_h)`
## [1] "contr.treatment"
fit.1 <- lmFit(a.rma, design.1) # returns an object of class MArrayLM 
fit.1
## An object of class "MArrayLM"
## $coefficients
##           Intercept        ES      T48    ES:T48
## 1000_at     10.3263 -0.372516 -0.12207  0.272517
## 1001_at      5.7993  0.107485  0.19140  0.034936
## 1002_f_at    5.6572 -0.067603 -0.21491  0.194421
## 1003_s_at    7.8960  0.040608  0.13673  0.040851
## 1004_at      7.4464  0.183606  0.11425 -0.178612
## 12620 more rows ...
## 
## $rank
## [1] 4
## 
## $assign
## [1] 0 1 2 3
## 
## $qr
## $qr
##         Intercept      ES         T48   ES:T48
## NEG10-1  -2.82843 -1.4142 -1.4142e+00 -0.70711
## NEG10-2   0.35355  1.4142  1.1102e-16  0.70711
## ES10-1    0.35355 -0.4459  1.4142e+00  0.70711
## ES10-2    0.35355 -0.4459  3.5355e-01  0.70711
## NEG48-1   0.35355  0.2612 -5.0000e-01  0.63060
## NEG48-2   0.35355  0.2612 -5.0000e-01  0.63060
## ES48-1    0.35355 -0.4459 -3.5355e-01 -0.26120
## ES48-2    0.35355 -0.4459 -3.5355e-01 -0.26120
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$`factor(estrogen)`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`factor(time_h)`
## [1] "contr.treatment"
## 
## 
## $qraux
## [1] 1.3536 1.2612 1.3536 1.2612
## 
## $pivot
## [1] 1 2 3 4
## 
## $tol
## [1] 1e-07
## 
## $rank
## [1] 4
## 
## 
## $df.residual
## [1] 4 4 4 4 4
## 12620 more elements ...
## 
## $sigma
##   1000_at   1001_at 1002_f_at 1003_s_at   1004_at 
##   0.21776   0.12833   0.11769   0.15544   0.17066 
## 12620 more elements ...
## 
## $cov.coefficients
##           Intercept   ES  T48 ES:T48
## Intercept       0.5 -0.5 -0.5    0.5
## ES             -0.5  1.0  0.5   -1.0
## T48            -0.5  0.5  1.0   -1.0
## ES:T48          0.5 -1.0 -1.0    2.0
## 
## $stdev.unscaled
##           Intercept ES T48 ES:T48
## 1000_at     0.70711  1   1 1.4142
## 1001_at     0.70711  1   1 1.4142
## 1002_f_at   0.70711  1   1 1.4142
## 1003_s_at   0.70711  1   1 1.4142
## 1004_at     0.70711  1   1 1.4142
## 12620 more rows ...
## 
## $pivot
## [1] 1 2 3 4
## 
## $Amean
##   1000_at   1001_at 1002_f_at 1003_s_at   1004_at 
##   10.1471    5.9575    5.5645    7.9948    7.5507 
## 12620 more elements ...
## 
## $method
## [1] "ls"
## 
## $design
##         Intercept ES T48 ES:T48
## NEG10-1         1  0   0      0
## NEG10-2         1  0   0      0
## ES10-1          1  1   0      0
## ES10-2          1  1   0      0
## NEG48-1         1  0   1      0
## NEG48-2         1  0   1      0
## ES48-1          1  1   1      1
## ES48-2          1  1   1      1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$`factor(estrogen)`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`factor(time_h)`
## [1] "contr.treatment"
cont.1 <- cbind(es10=c(0,1,0,0), # contrast for the early-response targets
                es48=c(0,1,0,1)) # contrast for the late-response targets
fitC.1 <- contrasts.fit(fit.1, cont.1) # returns an object of class MArrayLM 
dim(fitC.1) 
## [1] 12625     2

3.1.2 Group-mean parametrization of the linear model

This much simpler design specifies one coefficient for each group of samples that will represent the group-mean expression. The tested contrasts are formed explicitly and therefore it is more comprehensible and less confusing for most users.

pheno <- pData(a.rma)
temp <- sub(pattern="absent", replacement="NEG", pheno[,"estrogen"])
temp <- sub(pattern="present", replacement="ES", temp)
group <- paste(temp, pheno[, "time_h"], sep="")
group <- factor(group)
design.2 <- model.matrix(~0 + group, pheno) # '~0' means no intercept coefficient
colnames(design.2) <- levels(group)
design.2
##         ES10 ES48 NEG10 NEG48
## NEG10-1    0    0     1     0
## NEG10-2    0    0     1     0
## ES10-1     1    0     0     0
## ES10-2     1    0     0     0
## NEG48-1    0    0     0     1
## NEG48-2    0    0     0     1
## ES48-1     0    1     0     0
## ES48-2     0    1     0     0
## attr(,"assign")
## [1] 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
fit.2 <- lmFit(a.rma, design.2)

cont.2 <- makeContrasts(es10="ES10 - NEG10", # contrast for the early-response targets
                        es48="ES48 - NEG48", # contrast for the late-response targets
                        levels=design.2) 
fitC.2 <- contrasts.fit(fit.2, cont.2)

If you are interested in the main effect of the treatment or the time-estrogen interaction effect we can define the following contrasts

cont.3 <- makeContrasts(es="(ES10 + ES48 - NEG10 - NEG48)/2",
                        es_time="(ES48 - NEG48 - ES10 + NEG10)", # the same meaning as the "ES:T48" coefficient in the first design
                        levels=design.2) 

If QC plots (e.g., obtained by limma::plotMDS) reveal batch effects associated for example with hybridization-/scan-date or lab technician or other covariate, then we can include additively this factor in the model design as follows

# lets assume that samples NEG10-1 and ES10-1 come from a different lab (and we want to take this into acount)
lab = factor(c("B","A", "B", "A", "A", "A", "A", "A"))
design.2b <- model.matrix(~0 + group + lab, pheno)
design.2b 
##         groupES10 groupES48 groupNEG10 groupNEG48 labB
## NEG10-1         0         0          1          0    1
## NEG10-2         0         0          1          0    0
## ES10-1          1         0          0          0    1
## ES10-2          1         0          0          0    0
## NEG48-1         0         0          0          1    0
## NEG48-2         0         0          0          1    0
## ES48-1          0         1          0          0    0
## ES48-2          0         1          0          0    0
## attr(,"assign")
## [1] 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
## 
## attr(,"contrasts")$lab
## [1] "contr.treatment"
nonEstimable(design.2b) # check the rank of the matrix before fitting the model with this design 
## NULL

In this case, actually we have treatment nested within the scan date effect. We can use the function limma::duplicateCorrelation to estimate the correlation between measurements made on the same date, and include this in the model. The idea is to make all comparison between the estrogen-treated and untreated samples within the same date and block for the possible scan date effect.

corfit <- duplicateCorrelation(a.rma, design.2, block=scan.date) # it will take minute or two for calculation
corfit$consensus.correlation
## [1] 0.28131
fit.3 = lmFit(a.rma, design.2, block=scan.date, cor=corfit$consensus.correlation)
fitC.3 <- contrasts.fit(fit.3, cont.2)

3.2 Testing for differential expression

The functions limma::eBayes and limma::treat are used to rank genes in order of DE evidence. Given a linear model fit such as fitC.1 and fitC.2, these functions compute moderated t-statistics, moderated F-statistics, or log-odds of differential expression by empirical Bayes moderation of the standard errors towards a common value.

3.2.1 Gene ranking with eBayes

The empirical Bayes moderated t-statistics test if each individual contrast equals to zero. For each probeset, the moderated F-statistic tests whether all the contrasts are zero. It is an overall test computed from the set of t-statistics for that probeset.

fitE <- eBayes(fitC.2)

One way to select genes targeted by estrogen at either time is to use the moderated F-statistics, the values of which are stored in fitE$F and the corresponding p-values stored in the component fitE$F.p.value. Lets plot the latter as well as the contrast-wise t-statistic p-values. Sharp peaks of small p-values on these plots indicate good DE detection power. What do you observe?

par(mfrow=c(1,3))

for(name_cont in colnames(fitE$p.value))
    hist(fitE$p.value[,name_cont], main=name_cont, xlab="P value (moderated t-test)")

hist(fitE$F.p.value, xlab="P value (moderated F-test)", main="")

par(par.old)

However, as we have performed the test multiple times (across genes, and across contrast) we need to adjust these values before we can select a p-value cutoff. The key functions for this purpose are limma::topTable and limma::decideTests. The first one adjust for multiple testing across genes, while the second one offers methods for adjustment across genes and contrasts.

3.2.1.1 topTable

limma::topTable summarizes the linear model fit object produced by eBayes, by selecting the top-ranked genes for any given contrast. The default setup calls significantly differentially expressed genes (DEGs) by controlling the false discovery rate (FDR), the expected proportion of false discoveries among the genes called significantly differentially expressed, with the Benjamini and Hochberg’s multiple testing adjustment method (option adjust.method='fdr'). We choose more stringent cutoff value (option p=0.01, default is 0.05) to output genes with adjusted p-values (here FDR estimates) lower than 0.01, and we keep all genes that pass the cutoff (option n=Inf, default is 10).

tab.f <- topTable(fitE, coef=NULL, adjust.method="fdr", p=0.01, n=Inf) # when coef=NULL (default) genes are ranked by the moderated F statistics
tab.es10 <- topTable(fitE, coef="es10", adjust.method="fdr", p=0.01, n=Inf) # when coef is specified genes are ranked by its corespondent moderated t-statistics 
head(tab.es10) # preview the first couple of rows from the resulting data frame
##           logFC AveExpr      t    P.Value  adj.P.Val      B
## 39642_at 2.9394  7.8765 23.717 4.7416e-09 3.1283e-05 9.9668
## 910_at   3.1137  9.6602 23.592 4.9557e-09 3.1283e-05 9.9425
## 31798_at 2.8002 12.1158 16.385 1.0257e-07 3.5111e-04 7.9773
## 41400_at 2.3810 10.0416 16.225 1.1124e-07 3.5111e-04 7.9169
## 40117_at 2.5553  9.6766 15.681 1.4729e-07 3.5762e-04 7.7051
## 1854_at  2.5076  8.5321 15.158 1.9455e-07 3.5762e-04 7.4908
nrow(tab.f) # number of DEGs for at least one of the two contrasts
## [1] 345
nrow(tab.es10) # number of DEGs for the contrast es10
## [1] 113
table(tab.es10$logFC>0)
## 
## FALSE  TRUE 
##    23    90

Visualization by volcano plots

Volcano plots are very useful for showing the relationship between logFC and p-value. Lets show this for the contrast es10, using the unadjusted p-values corresponding to the moderated t-statistics. Furthermore, we will compare these with the ones obtained by the ordinary two-group t-test, as calculated by the genefilter::rowttest function. What do you observe? Are the genes with higher fold changes also ranked high according to the (moderated) t-test? Which test gives more DEGs and why?

# wrapper function for displaying volcano plots with highlights
volcanoPlot <- function(x, y, main, 
                        xlim=range(x),
                        ylim=range(y),
                        xlab="Mean log2 fold change", 
                        ylab="Log10 P value", 
                        top=30)
    {
    # lets select the top-ranked genes according to x (log2-fold-changes)
    order.x = order(abs(x), decreasing=TRUE)[1:top]
    # lets select the top-ranked genes according to y
    order.y = order(y, decreasing=TRUE)[1:top]
    order.all = union(order.x, order.y)
    
    # x vs y 
    plot(x[-order.all], y[-order.all], xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, main=main)
    # highlight the selected genes 
    points(x[order.x], y[order.x], pch=16, col="blue")
    points(x[order.y], y[order.y], pch=1, col="red")
    
    invisible()
    }

par(mfcol=c(1,2))

# moderated t-test
tab.es10 <- topTable(fitE, coef="es10", adjust.method="fdr", p=1, n=Inf) # p=1 means no p-value based gene filtering
x <- tab.es10$logFC
y <- -log10(tab.es10$P.Value)
n.deg <- sum(y>=2) # number of DEGs corresponding to p-value cutoff of 0.01
volcanoPlot(x,y, main=paste("Moderated t-test:", n.deg, "DEGs"))
abline(h=2) # add horizontal line at p=0.01

# ordinary two-group t-test
a.rma.sub <- a.rma[, c("ES10-1", "ES10-2", "NEG10-1","NEG10-2")]
tt <- rowttests(exprs(a.rma.sub), factor(a.rma.sub$estrogen, levels=c("present", "absent")))
x <-  tt$dm # log2-fold-changes
y <- -log10(tt$p.value)
n.deg <- sum(y>=2) # number of DEGs corresponding to p-value cutoff of 0.01
volcanoPlot(x,y, main=paste("Ordinary t-test:", n.deg, "DEGs"))
abline(h=2) # add horizontal line at p=0.01

par(par.old)

When the sample size (number of replicates per group) are moderate or large, the ordinary t-statistics performs as well as the moderated t-statistics (there is enough data to get a good estimate of the gene-specific variance), but in the case of few replicates (e.g., two in our case) it performs poor. On the other hand, the moderated version provides better results because it borrows information across genes to provide better estimate of the gene-specific variance. Direct comparison of the gene-wise p-values from both tests will show the difference more clear.

x <- -log10(tab.es10$P.Value)
y <- -log10(tt[rownames(tab.es10), ]$p.value)
xy.lim <- c(0, max(c(x,y)))
plot(x,y, main="Test comparison (with sample size = 2)", 
     xlab="Log10 P value: ordinary t-test", 
     ylab="Log10 P value: moderated t-test", 
     xlim=xy.lim, ylim=xy.lim, pch=".")
abline(c(0,1), col="red") # add line x=y

Compare the results above with the same plot below that compares both statistics calculated using larger data set (artificially generated by just coping the existing data 4 times, to produce in total 8 replicates per group).

3.2.1.2 decideTests

limma::decideTests returns an object of class limma::TestResults, which is essentially a matrix of -1, 0 or 1 elements, of the same dimension as fitE$coefficients, indicating whether each coefficient is significantly different from zero for the selected multiple testing method.

We will compare two methods, separate (default option, adjustment down the genes per contrast) and nestedF (adjusts across contrasts). The first one is the good choice when the contrasts are not related, as testing a set of contrasts together will give the same results as when each contrast is tested on its own (equal to applying limma::topTable). The second one aims to give greater weight to probesets which are significant for two or more contrasts. Most multiple testing methods tend to underestimate the number of such probesets.

ebay.res <- list() # empty list, to be filled with the outcome of each hypothesis test
for (p in c(0.05, 0.01)) # repeat for two different cutoff values
    { 
    for (mmethod in c("nestedF", "separate")) # test two different multiple testing methods
        {
        id <- paste("eBayes_", mmethod, "_fdr", p, sep ="")
        ebay.res[[id]] <- decideTests(fitE, method=mmethod, adjust.method="fdr", p.value=p) # save the outcome of the test
        print(id)
        print(colSums(abs(ebay.res[[id]]))) # preview DEG counts per contrast
        }
    }
## [1] "eBayes_nestedF_fdr0.05"
## es10 es48 
##  492  664 
## [1] "eBayes_separate_fdr0.05"
## es10 es48 
##  326  638 
## [1] "eBayes_nestedF_fdr0.01"
## es10 es48 
##  230  295 
## [1] "eBayes_separate_fdr0.01"
## es10 es48 
##  113  250

Visualization by Venn diagrams and heatmaps

Lets make a summary of the results by Venn diagrams on a 2x2 multiplot figure. The limma::vennDiagram function plots Venn diagrams, showing numbers of genes significant in each comparison. How many genes are up-/down-regulated in estrogen-treated vs untreated cells after 10 hours and 48 hours, using the different test?

par(mfcol=c(2,2))
sapply(names(ebay.res),
       function(name)
           {
           vennDiagram(ebay.res[[name]], 
                       include=c("up", "down"), # show counts for up- and down-regulated DEGs
                       mar=c(0,0,0,0), # the width of the margins around the plot
                       counts.col=c("red", "blue"), # color for the counts
                       cex=c(1, 0.8, 0.7), # scaling factors for text on the plot
                       circle.col=c("green", "orange"), # color for the contrast
                       )
           title(name, line=1, cex.main=0.9) 
           })
par(par.old)

Lets visualize the results by heatmaps (with function gplot::heatmap.2), a useful tool to identify collection of genes that show similar patterns of up- or down-regulation by the estrogen treatment. We will generate two maps, one for the gene affected at both 10 and 48 hours, and one for the genes affected only at 10h. Can you see patterns in the gene expression? Are these consistent across replicates or conditions?

res <- abs(ebay.res[["eBayes_separate_fdr0.01"]]) # mark gene as DEG or not (ignore the direction)

list.ids <- list() 
list.ids[["es10"]] <- names(which(rowSums(res)==1 & res[ ,"es10"]==1))
list.ids[["es10_48"]] <- names(which(rowSums(res)==2))

colors.hm <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)) # blue (low expr.) to red (high espr.) color palette with 256 shades 

sel.exp <- exprs(a.rma)[unlist(list.ids, use.names=FALSE), ] # subset the expression matrix

for(name in names(list.ids))
    {
    title.text <- paste(name, ": log2 expression of", length(list.ids[[name]]), "targets")
    mat <- sel.exp[list.ids[[name]], ] # expression for selected probeset ids
    mat <- t(mat) # transpose, so the samples are in the rows
    heatmap.2(mat, col=colors.hm,
              dendrogram="col", # display gene reordering (default clustering method)
              Rowv=FALSE, # no sample reordering 
              main=title.text,
              breaks=seq(min(sel.exp), max(sel.exp), length.out=257), # same range of expr. values for both plots
              keysize=1, key.ylab=NA, key.xlab=NA, key.title="logFC", # the color-key parameters
              density.info="none", trace="none", mar=c(6,6))
    }

3.2.2 Gene ranking with treat

If we want to prioritize for both biological relevant and statistically significant results then we can use the limma::treat function. This computes empirical Bayes moderated t p-values relative to a minimum required logFC threshold in absolute value for each gene and contrast. No moderated F-statistics is computed. Gene ranking summaries can be obtained with the limma::topTreat function, in this case with adjustment for testing down the genes per contrast with the Benjamini and Hochberg’s method.

fitT <- treat(fitC.2, lfc=1) # |logFC| > 1, unlike |logFC| > 0 in eBayes() 

treat.res <- list() # empty list, to be filled with the outcome of each hypothesis test
for (p in c(0.05, 0.01)) # test two different cutoff values
    {   
    e10 <- topTreat(fitT, coef="es10", adjust.method="fdr", p.value=p, n=Inf)
    e48 <- topTreat(fitT, coef="es48", adjust.method="fdr", p.value=p, n=Inf)
    
    # we have to provide the summary DEG matrix, no decideTest-alike function for this case
    mat <- matrix(0, ncol=2, nrow=nrow(fitT), 
                  dimnames=list(rownames(fitT), 
                                colnames(fitT))) # initialize zero-valued matrix
    
    probesets <- list("up"=list("es10"=rownames(e10[e10$logFC>0, ]),
                                "es48"=rownames(e48[e48$logFC>0, ])),
                      "down"=list("es10"=rownames(e10[e10$logFC<0, ]),
                                  "es48"=rownames(e48[e48$logFC<0, ]))
                      ) # saves ids of DEGs in a nested list
    
    junk <- sapply(names(probesets),
                   function(type){
                       sapply(names(probesets[[type]]),
                              function(deg){
                                  ids <- probesets[[type]][[deg]]
                                  if(length(ids)>0)
                                      # to change values in global variable, such as 'mat',
                                      # inside a function use operator <<-
                                      mat[ids, deg] <<- ifelse(type=="up", 1, -1) 
                                  })
                       
                       }) # fills the matrix 'mat'
    
    id <- paste("treat_logFC>", fitT$treat.lfc,"_fdr", p, sep ="")
    treat.res[[id]] <- mat # save the outcome
    print(id)
    print(colSums(abs(treat.res[[id]]))) # preview DEG counts per contrast
    }
## [1] "treat_logFC>1_fdr0.05"
## es10 es48 
##   11   21 
## [1] "treat_logFC>1_fdr0.01"
## es10 es48 
##    3    6

Lets summarize the results via Venn diagrams. Compare the plot with the one from the eBayes-based results.

par(mfcol=c(1,2))
sapply(names(treat.res),
       function(name)
           {
           vennDiagram(treat.res[[name]], 
                       include=c("up", "down"), # show counts for up- and down-regulated DEGs
                       mar=c(0,2,0,0), # the width of the margins around the plot
                       counts.col=c("red", "blue"), # color for the counts
                       cex=c(1, 0.8, 0.7), # scaling factors for text on the plot
                       circle.col=c("green", "orange"), # color for the contrast
                       asp=1
                       )
           title(name, line=-3, cex.main=0.9) 
           })
par(par.old)


4 Annotation and summary reports (10 min)

1. Annotation

Now that have selected interesting genes by DE, we need to link the platform-dependent IDs (in this case, Affymetrix probeset IDs) to biologically meaningful information such as gene names, gene functions etc. For this purpose we need the appropriate annotation packages, i.e., in this case hgu95av2, which is already attached. Lets see what annotation information is provided with this package.

ls("package:hgu95av2.db")
##  [1] "hgu95av2"              "hgu95av2ACCNUM"        "hgu95av2ALIAS2PROBE"   "hgu95av2CHR"           "hgu95av2CHRLENGTHS"   
##  [6] "hgu95av2CHRLOC"        "hgu95av2CHRLOCEND"     "hgu95av2.db"           "hgu95av2_dbconn"       "hgu95av2_dbfile"      
## [11] "hgu95av2_dbInfo"       "hgu95av2_dbschema"     "hgu95av2ENSEMBL"       "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"     
## [16] "hgu95av2ENZYME"        "hgu95av2ENZYME2PROBE"  "hgu95av2GENENAME"      "hgu95av2GO"            "hgu95av2GO2ALLPROBES" 
## [21] "hgu95av2GO2PROBE"      "hgu95av2MAP"           "hgu95av2MAPCOUNTS"     "hgu95av2OMIM"          "hgu95av2ORGANISM"     
## [26] "hgu95av2ORGPKG"        "hgu95av2PATH"          "hgu95av2PATH2PROBE"    "hgu95av2PFAM"          "hgu95av2PMID"         
## [31] "hgu95av2PMID2PROBE"    "hgu95av2PROSITE"       "hgu95av2REFSEQ"        "hgu95av2SYMBOL"        "hgu95av2UNIGENE"      
## [36] "hgu95av2UNIPROT"

Lets obtain the gene symbols and Entrez ID and save it in a data frame for further use. For this we need the following functions from the package annotate, already attached. Note that all annotations are subjected to changes over time, therefore you should always report the version of the annotation package with the results. You will find in the reference manual of the package the actual version of each database used.

pids <- rownames(fitE) 
ann <- annotation(a.rma) 
syms <- getSYMBOL(pids, ann) # get the gene symbols
egs <- getEG(pids, ann) # get the Entrez Gene IDs
gene.df <- data.frame(PROBESET=pids, SYMBOL=syms, ENTREZ_ID=egs)
head(gene.df)
##            PROBESET  SYMBOL ENTREZ_ID
## 1000_at     1000_at   MAPK3      5595
## 1001_at     1001_at    TIE1      7075
## 1002_f_at 1002_f_at CYP2C19      1557
## 1003_s_at 1003_s_at   CXCR5       643
## 1004_at     1004_at   CXCR5       643
## 1005_at     1005_at   DUSP1      1843

Lets check the gene frequency on the chip, i.e., how many probesets map to each gene, then aggregate this across all genes and see how many genes have been represented with multiple probesets. Some probesets don’t map to any gene (NA value in the table).

paste("Number of genes (with Entrez ID) represented by at least probeset on the chip: ", 
      length(unique(gene.df$ENTREZ_ID))
      )
## [1] "Number of genes (with Entrez ID) represented by at least probeset on the chip:  8595"
paste("Number of probesets that haven't been annotated with Entrez IDs: ", 
            sum(is.na(gene.df$ENTREZ_ID))
      )
## [1] "Number of probesets that haven't been annotated with Entrez IDs:  1155"
head(table(gene.df$SYMBOL)) # probesets per gene 
## 
##    AADAC     AAK1     AAMP    AANAT     AARS AASDHPPT 
##        1        5        1        1        1        1
table(table(gene.df$SYMBOL)) # lists the number of genes that have 1,2... X probesets on the chip
## 
##    1    2    3    4    5    6    7    8 
## 6563 1429  451  102   22   16    6    5

2. Save results in a file

Now we are ready to combine the annotation data with the DE results and save them in a tab-separated file format.

resT.1 <- topTreat(fitT, coef="es10", adjust.method="fdr",  genelist=gene.df, n=Inf)
write.table(resT.1, file="toptreat_ES10h_lfc1.tvs", sep="\t", row.names=FALSE, col.names=TRUE)
resT.2 <- topTreat(fitT, coef="es48", adjust.method="fdr", genelist=gene.df, n=Inf)
write.table(resT.2, file="toptreat_ES48H_lfc1.tvs", sep="\t", row.names=FALSE, col.names=TRUE)
head(resT.1)
##          PROBESET SYMBOL ENTREZ_ID  logFC AveExpr       t    P.Value adj.P.Val
## 910_at     910_at    TK1      7083 3.1137  9.6602 16.0154 6.2137e-08 0.0004741
## 39642_at 39642_at ELOVL2     54898 2.9394  7.8765 15.6485 7.5105e-08 0.0004741
## 31798_at 31798_at   TFF1      7031 2.8002 12.1158 10.5337 1.8496e-06 0.0077838
## 40117_at 40117_at   MCM6      4175 2.5553  9.6766  9.5441 4.0210e-06 0.0113297
## 41400_at 41400_at    TK1      7083 2.3810 10.0416  9.4105 4.4870e-06 0.0113297
## 1854_at   1854_at  MYBL2      4605 2.5076  8.5321  9.1135 5.7625e-06 0.0121253

3. HTML reports

We can provide similar reports with more biologically relevant information (with links to the original annotation sources) in a HMTL format by using the package annaffy as follows.

selected.ids <- rownames(resT.1)[1:10] # top 10 early-response estrogen targets

library(annaffy)
aaf.handler() # which information is available
##  [1] "Probe"               "Symbol"              "Description"         "Chromosome"          "Chromosome Location" "GenBank"            
##  [7] "Gene"                "Cytoband"            "UniGene"             "PubMed"              "Gene Ontology"       "Pathway"
which.info <- c(1:3,7,11:12) # which information we will select

ann.table <- list()
ann.table[[1]] <- aafTableAnn(selected.ids, "hgu95av2.db", aaf.handler()[which.info]) # generates annotation table
# lets add some more information, e.g., logFC and AveExpr (average log2-expression for the probeset over all samples), p-value
ann.table[[2]] <- aafTable(logFC=resT.1[selected.ids, "logFC"], signed=TRUE)
ann.table[[3]] <- aafTable(AvgExpr=resT.1[selected.ids, "AveExpr"])
ann.table[[4]] <- aafTable(adj.P.Value=resT.1[selected.ids,"adj.P.Val"])
merged.table <- ann.table[[1]]
for(i in 2:4)
    merged.table <- merge(merged.table, ann.table[[i]])
saveHTML(merged.table, filename="ES10_top10.html", 
         title <- "Top 10 early-response estrogen targets with statisticaly significant |logFC| > 1 ")

5 Exercises (30 min)

Perform gene expression analysis on the data GSE20986
The goal of the study by Browning et al. (2012) was to find whether the human umbilical vein endothelial cells (huvec) are suitable surrogate when studying pathophysiological mechanisms of ocular disorders. The first step towards the answer is to compare the gene expression profiles of unpassaged, proliferating (huvec) and human iris, retinal and choroidal microvascular endothelial cells.

  1. You need to define the data frame with sample metadata.
  2. You need to install the metadata packages hgu133plu2.db and hgu133plu2.cdf for Affymetrix GeneChip Human Genome U133 Plus 2.0 arrays
  3. Perform quality assessment of the data, e.g., compare the distribution of the data before and after processing with rma.
  4. Fit a linear model with appropriate design matrix.
  5. Find all DEGs at 0.05 FDR for the following contrasts: huvec vs iris cells, huvec vs retinal cells, and huvec vs choroidal cells.
  6. Make a Venn diagram summary of all results.
  7. Make a volcano plot displaying the results of the huvec vs retinal cells. Mark with different colors the genes based on the group they belong, i.e., not significant, significantly up-regulated (logFC>2), significantly down-regulated (logFC<2)

6 References

Books that cover microarray analysis
Hahne et al. (2008). Bioconductor case studies
Gentleman et al. editors (2005). Bioinformatics and Computational Biology Solutions using R and Bioconductor
This tutorial is based on the material presented in these books.

Articles
Huber et al. (2015). Orchestrating high-throughput genomic analysis with Bioconductor
Ritchie et al. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies
Davis and Meltzer (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor
Scholtens et al. (2004). Analyzing factorial designed microarray experiments
Storey and Tibshirani (2003). Statistical significance for genomewide studies

Other web resources/tools
Bioconductor training materials, courses on (various) genomic data analysis.
RStudio cheat sheets, useful if you plan to use more often Rstudio.
GEO2R, an interactive web tool that allows users to compare two or more groups of samples (from a given GEO accession) in order to identify genes that are differentially expressed across experimental conditions.