【工具】NPM用最接近配对校正组学数据的潜在批效应
介绍
批效应(BEs)是组学数据中的主要噪声源,经常掩盖真实的生物信号。BEs在现有数据集中仍然很常见。目前的BE校正方法大多依赖于特定的假设或复杂的模型,可能无法充分检测和调整BE,从而影响下游分析和发现能力。为了解决这些挑战,我们开发了NPM,这是一种基于最近邻匹配的方法,可以调整BEs,并且在广泛的数据集中可能优于其他方法。
我们评估了不同的指标和图形读数,并将我们的方法与常用的BE校正方法进行了比较。NPM显示了在保留生物差异的同时纠正生物多样性的能力。它可能优于基于多个指标的其他方法。总之,NPM被证明是一种有价值的be纠正方法,可以最大限度地提高生物医学研究的发现,适用于潜在be往往占主导地位的临床研究。
Abstract Motivation Batch effects (BEs) are a predominant source of noise in omics data and often mask real biological signals. BEs remain common in existing datasets. Current methods for BE correction mostly rely on specific assumptions or complex models, and may not detect and adjust BEs adequately, impacting downstream analysis and discovery power. To address these challenges we developed NPM, a nearest-neighbor matching-based method that adjusts BEs and may outperform other methods in a wide range of datasets. Results We assessed distinct metrics and graphical readouts, and compared our method to commonly used BE correction methods. NPM demonstrates ability in correcting for BEs, while preserving biological differences. It may outperform other methods based on multiple metrics. Altogether, NPM proves to be a valuable BE correction approach to maximize discovery in biomedical research, with applicability in clinical research where latent BEs are often dominant.
代码
原理:
代码语言:javascript代码运行次数:0运行复制NPM (Nearest-Pair Matching) relies on distance-based matching to deterministically search for nearest neighbors with opposite labels, so-called “nearest-pair”, among samples. NPM requires knowledge of the phenotypes but not of the batch assignment.
## Load NPmatch and limma
library("NPmatch")
library("limma")
## X: raw data matrix, with features in rows and samples in columns.
## Meta: matrix or dataframe with the metadata associated with X.
## We need to ensure that the samples in X and Meta are aligned.
X <- read.table("./data/GSE10846.Expression.txt", sep="\t")
Meta <- read.table("./data/GSE10846.Metadata.txt", sep="\t")
dim(X); class(X)
dim(Meta); class(Meta)
table(rownames(Meta) == colnames(X))
## To correct BEs, NPmatch requires a vector of phenotype labels per sample.
## To assess BE correction, we will also need a vector of batch labels (see below).
## "pheno": phenotype labels.
## "batch": batch labels.
pheno <- Meta[,"dlbcl.type"]
batch <- Meta[,"Chemotherapy"]
## Intra-sample normalization of the raw data.
## We use the normalize.log2CPM.R function provided
nX <- normalize.log2CPM(X)
## Inter-sample normalization by quantile normalization
nX <- limma::normalizeQuantiles(nX)
## Batch correction with NPmatch
cX <- NPmatch(X=nX, y=pheno, dist.method="cor", sdtop=5000)
table(rownames(Meta) == colnames(cX))
## Check BEs in the raw and batch-corrected data by UMAP or t-SNE
LL <- list(X, cX)
names(LL) <- c("Uncorrected", "Batch-corrected")
Var <- c("Batch", "Pheno")
x11(width = 10, height = 10)
par(mfrow = c(2,2))
i=1
for(i in 1:length(LL)) {
nb <- max(1, min(30, round(ncol(LL[[i]]) / 5)))
# pos <- Rtsne::Rtsne(t(LL[[i]]), perplexity=nb)$Y
pos <- uwot::tumap(t(LL[[i]]), n_neighbors = max(2, nb))
pos <- data.frame(Dim1=pos[,1], Dim2=pos[,2], Pheno=pheno, Batch=batch)
table(rownames(pos) == colnames(cX))
pos[,1:2] <- apply(pos[,1:2], 2, function(x) as.numeric(x))
pos$Col.Pheno <- as.numeric(factor(pos$Pheno))
pos$Col.Batch <- as.numeric(factor(pos$Batch))
v=1
for(v in 1:length(Var)) {
Col <- pos[,paste0("Col.",Var[v])]
plot(pos$Dim1,
pos$Dim2,
col = Col,
xlab = "Dim1",
ylab = "Dim2",
pch = 18,
cex = 0.8,
cex.lab = 1.3,
cex.axis = 1.3,
las = 1,
tcl = -0.1,
mgp = c(1.5,0.5,0))
mtext(names(LL)[i],
font = 2,
adj = 0.5,
cex = 1)
legend("bottomleft",
unique(pos[,Var[v]]),
cex = 1,
bty = "n",
fill = unique(Col),
col = unique(Col))
grid(lwd = 1.2)
}
}
参考
- NPM: Latent Batch Effects Correction of Omics data by Nearest Pair Matching
发布者:admin,转转请注明出处:http://www.yc00.com/web/1748030524a4721095.html
评论列表(0条)