Gene expression methods | Computational Biology and Data Mining

DNA microarrays contain probe sets that are intended to evaluate the expression of a particular gene. However, the changing nature of the annotation of human and mouse genomes implies that probe sets now thought to detect different genes may be found to be detecting parts of the same gene. We have evaluated these changes across two years of versions of the NetAffx database of annotations of the Affymetrix probe sets [1]. Our analysis is a reminder to researchers of the necessity of reporting DNA microarray data in terms of probe set identifiers and not only in terms of genes, since the associations between probe sets and genes are unstable.

We have developed a method for the detection of marker genes in large heterogeneous collections of gene expression data [2]. The expression values of individual genes are examined across the whole dataset for demarcations which suggest the presence of groupings, or subsets, of samples in the data. All genes are then evaluated for their ability to support those groupings of samples, identifying gene clusters which can demarcate similar sets of samples by their expression values. This method acts independently of information regarding gene or sample identity. We applied this method to DNA microarray data generated from 83 mouse stem cell related samples from StemBase, analyzed with the Affymetrix MOE430 DNA chip set, which includes approximately 45,000 probe sets. The results are available at the Marker Server. Our method identified 45 of 71 known stem cell markers present on the microarray (63%) at a five-fold level of enrichment. Examination of the 426 gene markers associated with a selected set of six stem cell differentiation experiments pointed to the relevance of genes encoding extracellular matrix proteins and protease inhibitors (including serpins a1b, a3n, b9, g1). Sequence comparison identified superfamilies with multiple stem cell related genes; phylogenetic analysis of examples from four of these families (nuclear receptors, cytochrome P450, Rab GTPases, early B-cell factors) illustrated gene duplication events in the Chordate lineage that generated specialized genes with stem cell related functions before mammalian divergence.

A more recent implementation of the idea for the Marker Server resulted in the [MGFM] (Marker Gene Finder in Microarray data), which we made available as part of the CellFinder resource. Here, marker genes are identified by the ability of segregating samples of the same type when sorting them by the values of hybridization to a given microarray probe [3]. We verified the approach with samples from human brain, heart, kidney, liver and lung. Top ranked genes could be validated by RT-PCR. We adapted this method for the detection of markers in RNA-seq data ([MGFR], [4]).

We propose a method to detect genes that could be used as cell markers from gene expression profiling of multiple samples of cell mixtures, provided that there is an alternative method to estimate the fraction of the cell type of interest [5]. We hypothesized that given a collection of samples where the cell type of interest occurs with a wide range of cell fractions, gene expression measured with microarrays can be used to identify genes whose expression correlates with the fraction of the target cell. We tested this with mixtures of four human cell lines finding that the correlated genes tend to be specifically expressed in the corresponding cell types.

DNA microarray data profiling gene expression can be used to detect chromosomal abnormalities given a dataset including measurements from normal and aberrant genomes. Chromosomal regions with gross deletions or duplications can be detected by their abnormally low or high general gene expression. We have implemented a method (CAFE) as an R package to do such analysis and visualize the results [6].

Reducing sample size

In particular biological set-ups it is advantageous to be able to study the expression of cellular samples of small size versus single cell. This is the case for brain tumour associated macrophages (TAMs), which are heterogeneous. The question arises of whether the biology of the sample is affected by the reduction in cell density. We explored this for human monocyte-derived macrophages stimulated to produce different inflamatory responses, obtaining meaningful results for samples from 100,000 cells down to 3,610 cells [7]. This is relevant given the scarcity of material for brain tumour associated macrophages (TAMs) and suggests that small samples of TAMs will provide reliable gene expression profiles.

References

[1] Perez-Iratxeta, C. and M.A. Andrade. 2005. Inconsistencies over time in 5% of NetAffx probe-to-gene annotations. BMC Bioinformatics. 6, 183.

[2] Krzyzanowski, P.M. and M.A. Andrade-Navarro. 2007. Identification of novel stem cell markers using gap analysis of gene expression data. Genome Biology. 8, R193. [Marker Server]

[3] El Amrani, K., H. Stachelscheid, F. Lekschas, A. Kurtz and M.A. Andrade-Navarro. 2015. MGFM: a novel tool for detection of tissue and cell specific marker genes from microarray gene expression data. BMC Genomics. 16, 645. [Marker Tool]

[4] El Amrani, K., G. Alanis-Lobato, N. Mah, A. Kurtz and M.A. Andrade-Navarro. 2019. Detection of condition-specific marker genes from RNA-seq data with MGFR. PeerJ. 7, e6970. [MGFR]

[5] Andrade-Navarro, M.A., F. Kanji, C. Palii, M. Brand, H. Atkins and C. Perez-Iratxeta. 2013. A method for cell marker discovery by high-throughput gene expression analysis of mixed cell populations. BMC Biotechnology. 13, 80.

[6] Bollen, S., M. Leddin, M.A. Andrade-Navarro and N. Mah. 2014. CAFE: an R package for the detection of gross chromosomal abnormalities from gene expression microarray data. Bioinformatics. 30, 1484-1485. [CAFE]

[7] Geiß, C., G. Alanis-Lobato, M.A. Andrade-Navarro and A. Régnier-Vigouroux. 2019. Assessing the reliability of gene expression measurements in very-low-numbers of human monocyte-derived macrophages. Sci. Rep. 9, 17908.