Data quality and imputation

Data imputation

A common problem of very sensitive but technically complex high-throughput techniques that detect the concentration of hundreds to thousand of proteins, RNA transcripts or genomic features, is that it is not possible to know if a feature might not be detected because of a technical problem (value missing at random) or because the feature was truly absent (below the limit of detection, missing not at random). Imputation methods detect associations between features in sets of samples and then evaluate if a distribution of detected features is likely to imply that a missing feature should or should not have been detected. Application of imputation methods is thus expected to complete the data and increase its quality. The risk is the introduction of unexpected biases.

Missing value imputation in proteomics data

Proximity extension assays (PEA) use paired antibodies that are oligonucleotide-labeled. When they bind a target protein, the DNA oligos pair and a new sequence can be formed by DNA polymerase extension. The amplification and quantification of the new sequence by rtPCR evaluates the concentration of the target protein. Panels detecting hundreds of proteins from a single sample are commonly used.

During a research project, we obtained PEA data measuring the concentration of 458 inflammation related proteins in 802 plasma samples from individuals with venous tromboembolism. A failure of one PEA chip, affecting measurements of 91 proteins in 86 samples, and the posterior remeasurement of the corresponding samples with a new chip, allowed us to evaluate imputation methods by comparing imputed to remeasured values [1]. In particular, we evaluated two methods, missForest and GSimp, that had been used for proteomics (but not specifically for methods of targeted proteomics such as PEA). GSimp performed better than missForest (71.6% and 69.0% median Pearson correlation between imputed and remeasured values, respectively).

Imputation of single-cell ChIP-seq data

To address the sparsity of single-cell ChIP-seq data, we developed SIMPA, a method that extracts predictive information from the bulk ChIP-seq ENCODE dataset to make data imputation of single-cell ChIP-seq data [2]. SIMPA can be used for target histone marks or transcription factors, and it can be applied individually to each single-cell sample from a single-cell ChIP-seq dataset.

Evaluation of Next Generation Sequencing (NGS) data quality

The evaluation of the quality of NGS data files is necessary to make possible downstream analyses successfully providing biological or clinical insight. We found current guidelines to assess NGS file quality to be limited.

To generate more detailed guidelines we used statistical analysis of public datasets with quality features calculated by common computational methods. We confirm the relevance of genome mapping statistics and show that some features that are currently used to assess quality are not always relevant [3].

To offer an automated procedure to evaluate NGS quality, we created seqQscorer, a method that uses as input multiple features that characterize NGS data files and that was trained on high and low quality data from ENCODE [4]. We provide associations between particular characteristics of different types of NGS data (in humans and mice) and their quality. These associations reveal technical details that influence the performance of NGS.

We observed that seqQscorer can actually identify batches in NGS datasets and can be used to correct such batch effects with performance that compares well with a method that uses a priori knowledge of batches [5] and improves when removing outliers.



[1] Lenz, M., A. Schulz, T. Koeck, S. Rapp, M. Nagler, M. Sauer, L. Eggebrecht, V. Ten Cate, M. Panova-Noeva, J.H. Prochaska, K.J. Lackner, T. Münzel, K. Leineweber, P.S. Wild and M.A. Andrade-Navarro. 2020. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One. 15, e0243487.

[2] Albrecht, S., T. Andreani, M.A. Andrade-Navarro and J.F. Fontaine. 2022. Single-cell specific and interpretable machine learning models for sparse ChIP-seq data imputation. PLOS ONE. 17, e0270043. [SIMPA]

[3] Sprang, M., M. Krüger, M.A. Andrade-Navarro, J.F. Fontaine. 2021. Statistical guidelines for quality control of next-generation sequencing techniques. Life Science Alliance. 4, e202101113.

[4] Albrecht, S., M. Sprang, M.A. Andrade-Navarro and J.F. Fontaine. 2021. seqQscorer: automated quality control of next generation sequencing data using machine learning. Genome Biology. 22, 75. [seqQscorer]

[5] Sprang, M., M.A. Andrade-Navarro and J.F. Fontaine. 2022. Batch effect detection and correction in RNA-Seq data using machine-learning-based automated assessment of quality. BMC Bioinformatics. 23, 279.