MapBatch – A conservative deep learning batch normalisation method for analysis of single-cell RNA-seq data
Many biological processes involve the participation of rare cell populations such as cancer stem cells in cancer or immune cell subsets in various diseases. The identification of such cell subpopulations can aid disease subtype identification, rational therapy design, and prognosis prediction. Single-cell RNA sequencing (scRNA-seq) can identify distinct cell populations across multiple samples with batch normalisation used to reduce processing-based effects between samples. However, existing batch normalisation tools tend to “over correct” (i.e. they remove batch effects but, at the same time, also eliminate genuine biological signals and introduce other artifacts), obscuring rare cell populations which may become merged into other cell types. There is a need for conservative batch normalisation that maintains the biological signal necessary to detect rare cell populations.
MapBatch is a deep learning tool that performs conservative batch normalisation to maintain biological signal for downstream analysis, enabling the discovery of previously difficult to find rare and cryptic cell populations.
MapBatch is based on two principles: firstly, an autoencoder trained with a single sample can learn the underlying gene expression structure of cell types in this sample without batch effect; secondly, an ensemble model can combine multiple such autoencoders, allowing the use of multiple samples for training.
MapBatch is an ensemble of autoencoders, each trained with a single sample. The training is done with a minimum number of samples necessary to cover the different cell populations in the dataset. The autoencoder outputs are concatenated and used for downstream analysis. The top principal components of the concatenated outputs are used for visualisation and clustering.
MapBatch allows the discovery of cells that would be “lost” if too much biological signal is reduced, such as small cell populations or cells specific to certain batches.
MapBatch derived all three (3/3) rare cell populations that were simulated in batches of peripheral blood mononuclear cells scRNA-seq data. In comparison, other batch normalization methods Seurat and Harmony derived only one out of three (1/3) rare cell populations while Liger did not derive any of them (0/3).
MapBatch maintains more biological signal for downstream analysis, enabling the discovery of previously difficult to find cell populations.
Using an ensemble of autoencoders, as opposed to a single autoencoder trained with multiple samples, allows MapBatch to expand its biological space to cover the diverse cell types without the autoencoders learning batch effects as well.
MapBatch has potential applications in single-cell RNA sequencing and other single-cell omics, where batch effect in data hinders combining multiple samples together for integrative analysis.
We welcome interest from the industry for collaboration/ co-development / customisation of the technology into a new product or service. If you have any enquiries or are keen to collaborate, please contact us here.