This is a step by step tutorial for differential gene expression (DE) analysis in the statistical environment R. It will focus on expression data obtained by microarrays, a high-throughput technology for genome-wide expression profiling. The analysis will be done on an example data set resulting from a larger multi-factor experimental design that involves estrogen receptor positive breast cancer cell lines with the aim to identify early- and late-response targets of estrogen treatment. For literature and software references see section References.

Requirements: Installed recent versions of R or RStudio on your computers.

1 Getting started

1.1 Installing the necessary R packages

For this tutorial, you need to install several packages from Bioconductor, an open-source software development project that provides tools for the analysis and comprehension of high-throughput genomic data. Other useful R packages and documentation are available from the Comprehensive R Archive Network CRAN.

# set two lists with the names of the Biocodnuctor and the CRAN packages
bioc.libs <- c("Biobase", "affy", "annotate", "hgu95av2.db", "hgu95av2cdf", "estrogen",
               "limma", "genefilter", "vsn", "GEOquery", "annaffy", "GO.db","KEGG.db")
cran.libs <- c("RColorBrewer", "R.utils", "gplots")

Before we continue with installing, lets check for available updates on the already available packages, using the function biocLite from the Bioconductor package BiocInstaller. If invoked with no arguments it installs core packages (when Bioconductor is installed for the first time) or updates already installed packages.

source("http://bioconductor.org/biocLite.R") # get BiocInstaller
biocLite() # if they are outdated packages, choose the option to update all the packages

After the update, restart the R session (in Rstudio go to menu tab Session -> Restart R). Now load the BiocInstaller package and finally install the packages saved in the lists above.

library(BiocInstaller)
biocLite(bioc.libs)
biocLite(cran.libs)

After successful installation, restart the R session again, and check the status of the available packages: in your case, everything should be up to date.

library(BiocInstaller)
biocValid() # should return TRUE

1.2 R help

A short reminder on how to get help for various topics (e.g., packages, functions, operators, concepts) as follows
help() or ? is primary help for functions and features,
help.search() or ?? finds help pages on a vague topic,
library() lists available packages and the help objects they contain,
data() lists available data sets,
ls lists names of the objects in a specified environment.

help(package="affy") # get help even when package is not loaded
?? affy
help.search("affy") 
? sapply #help(sapply)  
#' some topics, such as operators or control-flow reserved words, need to be quoted
? "if"
? "+"
## open the vignette for the Biobase package (the package has to be already attached)
library(Biobase) # base functions for Bioconductor
openVignette("Biobase") 
# list the content of the package (the package has to be already attached)
ls("package:Biobase")[1:12] # the list is long, so display the first 12 entries only
##  [1] "abstract"                    "addVigs2WinMenu"             "aggenv"                      "aggfun"                     
##  [5] "Aggregate"                   "AnnotatedDataFrame"          "annotatedDataFrameFrom"      "annotation"                 
##  [9] "annotation<-"                "anyMissing"                  "as.data.frame.ExpressionSet" "assayData"
# shows the user-defined datasets and functions in the current enviorment
ls() 
## [1] "bioc.libs" "cran.libs"

1.3 Set up a working directory

It is quite handy to have a project-specific working directory, where the R process will save by default plots or data of the project-related analysis. This can be done by using the function setwd.

main.dir <- file.path(getwd(), "tutorial_project") # make file path in a platform-independent way
dir.create(main.dir) # create a subdirectory for the current project
setwd(main.dir) # set the working directory
getwd() # display the current working directory
## [1] "/home/taskova/R/teaching/tutorial_project"

1.4 Load and attach packages

You can always attach an available (installed) package later in the script, just before using its functions and/or data. For simplicity, we will include most of the packages at this point.

library(affy) # data analysis and exploration of Affymetrix array probe level data
library(annotate) # annotation functionality  
library(hgu95av2.db) # gene chip specific metadata
library(hgu95av2cdf) # info relating probe pair sets to locations on the array
library(estrogen) # data for our analysis 
library(genefilter) # gene filtering functionality 
library(limma) # linear modeling and statistical testing for differential expression 
library(R.utils) # additional useful functions
library(RColorBrewer) # nice looking color palettes
library(gplots) # heatmap function

sessionInfo is an important function for reproducibility of results. It prints version information about R, the OS and attached or loaded packages. Best practice is to include it as the last command in the R script and save the result in a file.

sessionInfo() # output on the screen
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 15.10
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_2.17.0        RColorBrewer_1.1-2   R.utils_2.1.0        R.oo_1.19.0          R.methodsS3_1.7.0    limma_3.24.14       
##  [7] genefilter_1.50.0    estrogen_1.14.0      hgu95av2cdf_2.16.0   hgu95av2.db_3.1.3    org.Hs.eg.db_3.1.2   RSQLite_1.0.0       
## [13] DBI_0.3.1            annotate_1.46.1      XML_3.98-1.3         AnnotationDbi_1.30.1 GenomeInfoDb_1.4.1   IRanges_2.2.5       
## [19] S4Vectors_0.6.2      affy_1.46.1          Biobase_2.28.0       BiocGenerics_0.14.0 
## 
## loaded via a namespace (and not attached):
##  [1] BiocInstaller_1.18.5  formatR_1.2           bitops_1.0-6          tools_3.2.2           zlibbioc_1.14.0       digest_0.6.8         
##  [7] evaluate_0.8          preprocessCore_1.30.0 yaml_2.1.13           stringr_1.0.0         knitr_1.12.3          caTools_1.17.1       
## [13] gtools_3.5.0          survival_2.38-3       rmarkdown_0.9.2       gdata_2.17.0          magrittr_1.5          codetools_0.2-14     
## [19] htmltools_0.3         splines_3.2.2         xtable_1.7-4          KernSmooth_2.23-15    stringi_1.0-1         affyio_1.36.0
capture.output(sessionInfo(), file="session.log") # output to a file in the working directory

2 Microarray data preprocessing (30 min)

2.1 Data import

Estrogen dataset

In this case, the data are provided by the package estrogen, which includes 8 samples from an estrogen receptor positive breast cancer human cell lines. After serum starvation, 4 samples were exposed to estrogen and 4 samples were left untreated. Each group of samples was profiled with Affymetrix HGU95Av2 gene chips after 10 hours (2 samples) and 48 hours (2 samples). The complete description of the data, that involves in total 32 samples from a experiment with 4 factors (estrogen treatment, time, drug treatment, translational inhibitor treatment), is given by Scholtens et al. (2004).

We can access the raw data, i.e., the probe intensity level data in .CEL format, as outputted by the Affymetrix software. These files are located in a subdirectory called extdata belonging to the corresponding package on your computer’s hard disk. The function system.file will help you find the exact location. For your own data, put all of the .CEL files in the same directory and specify the full path to it.

data.dir <- system.file("extdata", package="estrogen") # or provide the path
data.dir
## [1] "/home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata"
dir(data.dir) # lists the content of the directory 
##  [1] "bad.cel"       "estrogen.txt"  "high10-1.cel"  "high10-2.cel"  "high48-1.cel"  "high48-2.cel"  "low10-1.cel"   "low10-2.cel"  
##  [9] "low48-1.cel"   "low48-2.cel"   "phenoData.txt" "targLimma.txt"

GEO datasets

If you use data from the publicly available repositories of microarray data, such as the NCBI Gene Expression Omnibus (GEO), then the GEOquery package is quite handy. A single function, getGEOSuppFiles, can take as an argument a GEO accession and will download all the raw data associate with that accession. By default, the function will create a directory in the current working directory to store the raw data for the chosen GEO accession. The following code chunk is illustrates how to get the data with accession number GSE20986.

library(GEOquery) # load the package
geo <- getGEOSuppFiles("GSE20986")  # directory 'GSE20986' was created in the working directory

# data are compressed and archived so we have to extract them before using them
geo.dir <- file.path("GSE20986","data")
untar(file.path("GSE20986","GSE20986_RAW.tar"), 
      exdir=geo.dir # unarchive into subdirectory 'data' of directory 'GSE20986'
      ) 
cel.files <- list.files(geo.dir, pattern=".gz$", full.names=TRUE) # list files with file extension '.gz'
sapply(cel.files, gunzip) # uncompress each file with the function 'gunzip'
dir(geo.dir)
##  [1] "GSM524662.CEL" "GSM524663.CEL" "GSM524664.CEL" "GSM524665.CEL" "GSM524666.CEL" "GSM524667.CEL" "GSM524668.CEL" "GSM524669.CEL"
##  [9] "GSM524670.CEL" "GSM524671.CEL" "GSM524672.CEL" "GSM524673.CEL"
# now we can set data.dir, but we won't do it as our focus is on the estrogen data
# data.dir<-file.path(getwd(), "data_geo")

Sample-description data

The Biobase::AnnotatedDataFrame object is designed to standardize the storage and manipulation of phenotypic data and its metadata.
– Phenotypic data summarizes information about the samples (e.g., sex, age, and treatment status; referred to as covariates). The information can be represented as a table with S rows and V columns, where V is the number of covariates and S is the number of samples. These data can be accessed/set with the pData function.
– Metadata about the covariates, providing for each longer description, possible values, units etc.). These data can be accessed/set with the function varMetadata.
See the vignette of the Biobase package for more information.

The file estrogen.txt contains information on the samples that were hybridized onto the chip arrays. It can be used to create an AnnotatedDataFrame instance by calling the function Biobase::read.AnnotatedDataFrame. The header option tells the function that the first line in the file contains column headings, and the row.names option indicates that the first column of the file contains the row names. Note that you can always create similar text files in text editors of your preference, and then load them in R.

pd <- read.AnnotatedDataFrame(file.path(data.dir,"estrogen.txt"), 
                              header=TRUE, sep="", row.names=1)
pd
## An object of class 'AnnotatedDataFrame'
##   rowNames: low10-1.cel low10-2.cel ... high48-2.cel (8 total)
##   varLabels: estrogen time.h
##   varMetadata: labelDescription
sampleNames(pd) # the names of the samples, file names in this case
## [1] "low10-1.cel"  "low10-2.cel"  "high10-1.cel" "high10-2.cel" "low48-1.cel"  "low48-2.cel"  "high48-1.cel" "high48-2.cel"
pData(pd) # returns a data frame with samples as rows and covariates as columns
##              estrogen time.h
## low10-1.cel    absent     10
## low10-2.cel    absent     10
## high10-1.cel  present     10
## high10-2.cel  present     10
## low48-1.cel    absent     48
## low48-2.cel    absent     48
## high48-1.cel  present     48
## high48-2.cel  present     48
varMetadata(pd) # returns description of the covariates (empty at the moment)
##          labelDescription
## estrogen                 
## time.h

You can also create the data frame object from scratch (when no text file is provided). The minimal information needed is a data frame of phenotypic data. For better readability and no confusion, it is always good to provide description of the covariates as metadata.

pd.new <- AnnotatedDataFrame()
df <- data.frame(estrogen=rep(rep(c("absent", "present"), each=2), times=2),
                 time_h=rep(c(10, 48), each=4),
                 replicate=rep(c(1,2), times=2),
                 estrogen_tmp=rep(rep(c("low", "high"), each=2), times=2)) # if you prepare the data frame
rownames(df) <- apply(df, 1, 
                      function(item) # makes the sample name, i.e. file name
                          {paste(item["estrogen_tmp"],
                                 item["time_h"],"-", 
                                 item["replicate"], 
                                 ".cel", sep="")
                           }
                      ) 
pData(pd.new) <- df[, 1:3] # remove the last column, not needed
varMetadata(pd.new) <- data.frame(labelDescription=
                                      c("Estrogen treatment status with 2 values, absent  and present",
                                        "Time of measurement after treatment with 2 values, 10 and 48 hours",
                                        "Biological replicate id within exper. group with 2 values, 1 and 2"),
                                  row.names=colnames(pd.new))
pData(pd.new)
varMetadata(pd.new)
##              estrogen time_h replicate
## low10-1.cel    absent     10         1
## low10-2.cel    absent     10         2
## high10-1.cel  present     10         1
## high10-2.cel  present     10         2
## low48-1.cel    absent     48         1
## low48-2.cel    absent     48         2
## high48-1.cel  present     48         1
## high48-2.cel  present     48         2
##                                                             labelDescription
## estrogen        Estrogen treatment status with 2 values, absent  and present
## time_h    Time of measurement after treatment with 2 values, 10 and 48 hours
## replicate Biological replicate id within exper. group with 2 values, 1 and 2

Probe level data

Importing probe intensities from .CEL files is done with the function affy::ReadAffy. This function returns an affy::AffyBatch object that stores both probe-level intensities and metadata for the samples. This object can store also other data such as:
experimentData summarizes experiment-level data, e.g. the investigator or lab where the experiment was done,
featureData keeps metadata about the features (probes) on the gene chip,
protocolData keeps protocol data for sample processing.
Note that if the function is called with no arguments all the CEL files (if any) in the working directory are read. Since our files are not in the working directory, we have to specify a character vector with the file paths.

files <- list.celfiles(path=data.dir, full.names=TRUE) # list of absolute paths to all .CEL files 
files <- files[!grepl(pattern="bad.cel", files)] # grepl returns a logical vector (pattern match or not for each element of files), operator ! means logical negation
a <- ReadAffy(filenames=files, # pass the list of .CEL files
              verbose=TRUE)
## 1 reading /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-1.cel ...instantiating an AffyBatch (intensity a 409600x8 matrix)...done.
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high10-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high48-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/high48-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low10-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low10-2.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low48-1.cel
## Reading in : /home/taskova/R/x86_64-pc-linux-gnu-library/3.2/estrogen/extdata/low48-2.cel
a 
## AffyBatch object
## size of arrays=640x640 features (20 kb)
## cdf=HG_U95Av2 (12625 affyids)
## number of samples=8
## number of genes=12625
## annotation=hgu95av2
## notes=
pData(protocolData(a)) # scan date for each array (important, as can be potential source of batch effect)
##                       ScanDate
## high10-1.cel 06/28/02 13:30:01
## high10-2.cel 06/27/02 13:01:01
## high48-1.cel 01/11/02 13:32:10
## high48-2.cel 01/14/02 12:59:04
## low10-1.cel  06/27/02 13:34:19
## low10-2.cel  06/28/02 14:10:26
## low48-1.cel  01/11/02 13:20:16
## low48-2.cel  01/14/02 12:47:24

When passing the phenotypic data frame, be careful about two things. First, number of samples has to be equal to the number of .CEL files, and second the order has to match as well.

all(rownames(pd.new)==colnames(a))
rownames(pd.new)
colnames(a)
## [1] FALSE
## [1] "low10-1.cel"  "low10-2.cel"  "high10-1.cel" "high10-2.cel" "low48-1.cel"  "low48-2.cel"  "high48-1.cel" "high48-2.cel"
## [1] "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel"  "low10-2.cel"  "low48-1.cel"  "low48-2.cel"
a <- a[,rownames(pd.new)] # change the order of the samples to fit to pd.new 
phenoData(a) <- pd.new
all(rownames(pData(a))==colnames(a))
## [1] TRUE

We can always customize the information in a format suitable for us. For example, lets replace the sample names (the original file names) with shorter ones.

sample.names <- sub(pattern=".cel", replacement="", colnames(a))
sample.names <- sub(pattern="low", replacement="NEG", sample.names) 
sample.names <- sub(pattern="high", replacement="ES", sample.names) 
print(sampleNames(a) <- sample.names)
## [1] "NEG10-1" "NEG10-2" "ES10-1"  "ES10-2"  "NEG48-1" "NEG48-2" "ES48-1"  "ES48-2"

2.2 From probe intensity to gene expression values

Obtaining gene expression level data involves three steps:
– Background adjustment, to correct for non-specific hybridization and optical detection noise.
– Normalization, to make the different arrays comparable.
– Summarization, to obtain summary expression values (per probeset) from the probe intensities. One gene can be represented with more than one probeset.

affy has many methods for each step, that can be selected by using the two general functions expresso and threestep, or by special functions to produce specific expression measures such as rma or mas5.

bgcorrect.methods() # list available background adjustment methods
## [1] "bg.correct" "mas"        "none"       "rma"
normalize.AffyBatch.methods() # lists available normalization methods for AffyBatch objects
## [1] "constant"         "contrasts"        "invariantset"     "loess"            "methods"          "qspline"          "quantiles"       
## [8] "quantiles.robust"
express.summary.stat.methods() # lists available expression summarization methods
## [1] "avgdiff"      "liwong"       "mas"          "medianpolish" "playerout"

For example a call to expresso with mas background adjustment (the same is used in the Affymetrix MAS 5.0 software), quantile normalization and summarization by averaging probe intensities (only the PM probes)

a.exp <- expresso(a, bgcorrect.method="mas", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method="medianpolish")

We will use the rma function, which implements the Robust Multi-array Average (RMA) algorithm that combines convolution background correction, quantile normalization and median-polish-based multi-array summarization. Note that the function returns expression values on a log2 scale.

a.rma <- rma(a)
## Background correcting
## Normalizing
## Calculating Expression
a.rma
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12625 features, 8 samples 
##   element names: exprs 
## protocolData
##   sampleNames: NEG10-1 NEG10-2 ... ES48-2 (8 total)
##   varLabels: ScanDate
##   varMetadata: labelDescription
## phenoData
##   sampleNames: NEG10-1 NEG10-2 ... ES48-2 (8 total)
##   varLabels: estrogen time_h replicate
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2

The obtained data is an instance of the Biobase::ExpressionSet object that has standardized structure similar to the one of affy::AffyBatch. It stores probeset-summarized expression values instead of probe intensities. Explore it with the help of the package documentation.

class(a.rma)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
class(exprs(a.rma))
## [1] "matrix"
head(exprs(a.rma))
##           NEG10-1 NEG10-2  ES10-1 ES10-2 NEG48-1 NEG48-2  ES48-1 ES48-2
## 1000_at   10.3982 10.2544 10.0040 9.9035 10.3749 10.0335 10.3451 9.8633
## 1001_at    5.7176  5.8810  5.8596 5.9540  5.9605  6.0209  5.9811 6.2852
## 1002_f_at  5.5126  5.8018  5.5711 5.6081  5.3901  5.4945  5.5081 5.6301
## 1003_s_at  7.7839  8.0080  8.0380 7.8351  7.9265  8.1389  7.9949 8.2333
## 1004_at    7.2892  7.6037  7.4885 7.7715  7.5218  7.5995  7.4561 7.6752
## 1005_at    9.2067  8.9938  8.2379 8.3380  9.1732  9.0405  7.9261 8.0697

2.3 Data quality assessment

Before we proceed with differential expression analysis, we need to perform quality control (QC) assessment and make sure there are no issues with the data. There are many tools that can produce informative reports, including different QC criteria such as expression distribution, RNA degradation. You can have a look at the Bioconductor packages, arrayQualityMetrics and affyExpress. Here, we will show several diagnostic plots for graphical exploration of the data quality before and after normalization.

Array surface images

The affy::image function outputs the spatial distribution of the intensities on a chip, and allows to look for spatial artifacts. In our case, none of the array has problems, but for illustration purposes we include one bad array (bad.cel). We will plot all arrays, including the bad one, on one multiplot figure. The options in the par function (from the base package graphics) can be used to set the layout, as follows. You can uncomment the second and the last line if you want the figure to be saved in .jpeg file.

a.bad = ReadAffy(filenames = file.path(data.dir,"bad.cel"), verbose=TRUE)
# jpeg("estrogen_array_images.jpg", quality=100)
par.old = par(no.readonly = TRUE) # save the default par options
par(mfcol=c(3,3), mar=c(0,3,3,1))
sapply(1 : (length(a) + 1),
       function(x){
           if(x==9){
               image(a.bad, main="bad.cel")
               }
           else{
               image(a[, x], main=colnames(a[,x]))
               }
       })
par(par.old) #return to the default setting
# dev.off()