Data quality and imputation | Computational Biology and Data Mining

Data imputation

A common problem of very sensitive but technically complex high-throughput techniques that detect the concentration of hundreds to thousand of proteins, RNA transcripts or genomic features, is that it is not possible to know if a feature might not be detected because of a technical problem (value missing at random) or because the feature was truly absent (below the limit of detection, missing not at random). Imputation methods detect associations between features in sets of samples and then evaluate if a distribution of detected features is likely to imply that a missing feature should or should not have been detected. Application of imputation methods is thus expected to complete the data and increase its quality. The risk is the introduction of unexpected biases.

Missing value imputation in proteomics data

Proximity extension assays (PEA) use paired antibodies that are oligonucleotide-labeled. When they bind a target protein, the DNA oligos pair and a new sequence can be formed by DNA polymerase extension. The amplification and quantification of the new sequence by rtPCR evaluates the concentration of the target protein. Panels detecting hundreds of proteins from a single sample are commonly used.

During a research project, we obtained PEA data measuring the concentration of 458 inflammation related proteins in 802 plasma samples from individuals with venous tromboembolism. A failure of one PEA chip, affecting measurements of 91 proteins in 86 samples, and the posterior remeasurement of the corresponding samples with a new chip, allowed us to evaluate imputation methods by comparing imputed to remeasured values [1]. In particular, we evaluated two methods, missForest and GSimp, that had been used for proteomics (but not specifically for methods of targeted proteomics such as PEA). GSimp performed better than missForest (71.6% and 69.0% median Pearson correlation between imputed and remeasured values, respectively).

Imputation of single-cell ChIP-seq data

To address the sparsity of single-cell ChIP-seq data, we developed SIMPA, a method that extracts predictive information from the bulk ChIP-seq ENCODE dataset to make data imputation of single-cell ChIP-seq data [2]. SIMPA can be used for target histone marks or transcription factors, and it can be applied individually to each single-cell sample from a single-cell ChIP-seq dataset.

Evaluation of Next Generation Sequencing (NGS) data quality

The evaluation of the quality of NGS data files is necessary to make possible downstream analyses successfully providing biological or clinical insight. We found current guidelines to assess NGS file quality to be limited.

To generate more detailed guidelines we used statistical analysis of public datasets with quality features calculated by common computational methods. We confirm the relevance of genome mapping statistics and show that some features that are currently used to assess quality are not always relevant [3].

To offer an automated procedure to evaluate NGS quality, we created seqQscorer, a method that uses as input multiple features that characterize NGS data files and that was trained on high and low quality data from ENCODE [4]. We provide associations between particular characteristics of different types of NGS data (in humans and mice) and their quality. These associations reveal technical details that influence the performance of NGS.

We observed that seqQscorer can actually identify batches in NGS datasets and can be used to correct such batch effects with performance that compares well with a method that uses a priori knowledge of batches [5] and improves when removing outliers. Batch effects have deleterious effects that can be observed in published sequencing data. To show this, we used a selection of 40 groups of published samples of clinical relevance [6]. Heterogeneous quality was observed in 14 of the datasets. Interestingly, the expression of a number of genes correlated with quality. Functions associated with cellular stress where enriched in protein coding genes, whose expression increased in low quality samples. Differently, miRNAs had the opposite correlation, whereby their expression was lost in low quality samples: this is in accordance with their inhibitory effect. One can expect that the impact of stress that could be commonly observed across multiple low quality samples will be a triggering of expression of stress-related genes (highly expressed), facilitated by repression of corresponding miRNAs (lowly expressed) that suppress those stress genes in healthy cells.

References

[1] Lenz, M., A. Schulz, T. Koeck, S. Rapp, M. Nagler, M. Sauer, L. Eggebrecht, V. Ten Cate, M. Panova-Noeva, J.H. Prochaska, K.J. Lackner, T. Münzel, K. Leineweber, P.S. Wild and M.A. Andrade-Navarro. 2020. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One. 15, e0243487.

[2] Albrecht, S., T. Andreani, M.A. Andrade-Navarro and J.F. Fontaine. 2022. Single-cell specific and interpretable machine learning models for sparse ChIP-seq data imputation. PLOS ONE. 17, e0270043. [SIMPA]

[3] Sprang, M., M. Krüger, M.A. Andrade-Navarro, J.F. Fontaine. 2021. Statistical guidelines for quality control of next-generation sequencing techniques. Life Science Alliance. 4, e202101113.

[4] Albrecht, S., M. Sprang, M.A. Andrade-Navarro and J.F. Fontaine. 2021. seqQscorer: automated quality control of next generation sequencing data using machine learning. Genome Biology. 22, 75. [seqQscorer]

[5] Sprang, M., M.A. Andrade-Navarro and J.F. Fontaine. 2022. Batch effect detection and correction in RNA-Seq data using machine-learning-based automated assessment of quality. BMC Bioinformatics. 23, 279.

[6] Sprang, M., M.A. Andrade-Navarro and J.F. Fontaine. 2024. Overlooked poor-quality patient samples in sequencing data impair reproducibility of published clinically relevant datasets. Genome Biol. In press.