首页 > 其他分享 >scran doubletCluster

scran doubletCluster

时间:2024-10-22 11:43:41浏览次数:1  
标签:source 集群 cluster scran query clusters doublet doubletCluster

 

 Identify potential clusters of doublet cells based on whether they have intermediate expression profiles, i.e., their profiles lie between two other “source” clusters. This function is now deprecated, use findDoubletClusters from scDblFinder instead.

 

doubletCluster {scran} R Documentation

Detect doublet clusters

Description

Identify potential clusters of doublet cells based on intermediate expression profiles.

Usage

doubletCluster(x, ...)

## S4 method for signature 'ANY'
doubletCluster(x, clusters, subset.row = NULL,
  threshold = 0.05, ...)

## S4 method for signature 'SingleCellExperiment'
doubletCluster(x, ...,
  subset.row = NULL, assay.type = "counts", get.spikes = FALSE)

Arguments

x

A numeric matrix-like object of count values, where each column corresponds to a cell and each row corresponds to an endogenous gene.

Alternatively, a SingleCellExperiment object containing such a matrix.

...

For the generic, additional arguments to pass to specific methods.

For the ANY method, additional arguments to pass to findMarkers.

For the SingleCellExperiment method, additional arguments to pass to the ANY method.

clusters

A vector of cluster identities for all cells.

subset.row

See ?"scran-gene-selection".

threshold

A numeric scalar specifying the FDR threshold with which to identify significant genes.

assay.type

A string specifying which assay values to use, e.g., "counts" or "logcounts".

get.spikes

See ?"scran-gene-selection".

Details

This function detects clusters of doublet cells in a manner similar to the method used by Bach et al. (2017). For each “query” cluster, we examine all possible pairs of “source” clusters, hypothesizing that the query consists of doublets formed from the two sources. If so, gene expression in the query cluster should be strictly intermediate between the two sources after library size normalization.

We apply pairwise t-tests to the normalized log-expression profiles (see logNormCounts) to reject this null hypothesis. This is done by identifying genes that are consistently up- or down-regulated in the query compared to both of the sources. We count the number of genes that reject the null hypothesis at the specified FDR threshold. For each query cluster, the most likely pair of source clusters is that which minimizes the number of significant genes.

Potential doublet clusters are identified using the following characteristics:

  • Low number of significant genes, i.e., N in the output DataFrame. The threshold can be identified by looking for small outliers in log(N) across all clusters, under the assumption that most clusters are not doublets (and thus should have high N).

  • A reasonable proportion of cells in the cluster, i.e., prop. This requires some expectation of the doublet rate in the experimental protocol.

  • Library sizes of the source clusters that are below that of the query cluster, i.e., lib.size* values below unity. This assumes that the doublet cluster will contain more RNA and have more counts than either of the two source clusters.

For each query cluster, the function will only report the pair of source clusters with the lowest N. It is possible that a source pair with slightly higher (but still low) value of N may have more appropriate lib.size* values. Thus, it may be valuable to examine all.pairs in the output, especially in over-clustered data sets with closely neighbouring clusters.

The reported p.value is of little use in a statistical sense, and is only provided for inspection. Technically, it could be treated as the Simes combined p-value against the doublet hypothesis for the query cluster. However, this does not account for the multiple testing across all pairs of clusters for each chosen cluster, especially as we are chosing the pair that is most concordant with the doublet null hypothesis.

We use library size normalization (via librarySizeFactors) even if existing size factors are present. This is because intermediate expression of the doublet cluster is not guaranteed for arbitrary size factors. For example, expression in the doublet cluster will be higher than that in the source clusters if normalization was performed with spike-in size factors.

Value

DataFrame containing one row per query cluster with the following fields:

source1:

String specifying the identity of the first source cluster.

source2:

String specifying the identity of the second source cluster.

N:

Integer, number of genes that are significantly non-intermediate in the query cluster compared to the two putative source clusters.

best:

String specifying the identify of the top gene with the lowest p-value against the doublet hypothesis for this combination of query and source clusters.

p.value:

Numeric, containing the adjusted p-value for the best gene.

lib.size1:

Numeric, ratio of the median library sizes for the first source cluster to the query cluster.

lib.size2:

Numeric, ratio of the median library sizes for the second source cluster to the query cluster.

prop:

Numeric, proportion of cells in the query cluster.

all.pairs:

SimpleList object containing the above statistics for every pair of potential source clusters.

Each row is named according to its query cluster.

Author(s)

Aaron Lun

References

Bach K, Pensa S, Grzelak M, Hadfield J, Adams DJ, Marioni JC and Khaled WT (2017). Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing. Nat Commun. 8, 1:2128.

Lun ATL (2018). Detecting clusters of doublet cells in scranhttps://ltla.github.io/SingleCellThoughts/software/doublet_detection/bycluster.html

See Also

doubletCells, which provides another approach for doublet detection.

findMarkers, to detect DE genes between clusters.

Examples

# Mocking up an example.
ngenes <- 100
mu1 <- 2^rexp(ngenes)
mu2 <- 2^rnorm(ngenes)

counts.1 <- matrix(rpois(ngenes*100, mu1), nrow=ngenes)
counts.2 <- matrix(rpois(ngenes*100, mu2), nrow=ngenes)
counts.m <- matrix(rpois(ngenes*20, mu1+mu2), nrow=ngenes)

counts <- cbind(counts.1, counts.2, counts.m)
clusters <- rep(1:3, c(ncol(counts.1), ncol(counts.2), ncol(counts.m)))

# Compute doublet-ness of each cluster:
dbl <- doubletCluster(counts, clusters)
dbl

# Narrow this down to clusters with very low 'N':
library(scater)
isOutlier(dbl$N, log=TRUE, type="lower") 

# Get help from "lib.size" below 1.
dbl$lib.size1 < 1 & dbl$lib.size2 < 1 


[Package scran version 1.14.6 Index]

 

=========

Warning messages:
1: 'centreSizeFactors' is deprecated.
See help("Deprecated") 
2: 'clusters=' is deprecated.
Use 'groups=' instead.
See help("Deprecated") 

============

细节

该功能以类似于Bach等人(2017)使用的方法检测双细胞簇。对于每个“查询”集群,我们检查所有可能的“源”集群对,假设查询由两个源形成的双元组组成。如果是这样,在库大小归一化后,查询集群中的基因表达应该严格介于两个来源之间。

我们将成对t检验应用于归一化的对数表达谱(见logNormCounts),以拒绝这一零假设。这是通过识别在查询中与两个来源相比始终上调或下调的基因来实现的。我们统计在指定的FDR阈值下拒绝零假设的基因数量。对于每个查询集群,最有可能的一对源集群是最小化重要基因数量的集群。

使用以下特征识别潜在的双峰簇:

重要基因数量少,即输出DataFrame中的N。阈值可以通过在所有集群的log(N)中寻找小的异常值来识别,前提是大多数集群不是双峰(因此应该具有高N)。

集群中细胞的合理比例,即prop。这需要对实验方案中的双峰率有一些预期。

源集群的库大小低于查询集群,即lib.size*值低于unity。这假设双峰簇将包含比两个源簇中的任何一个更多的RNA,并且具有更多的计数。

对于每个查询集群,该函数将只报告N值最低的源集群对。N值稍高(但仍然较低)的源集群可能具有更合适的lib.size*值。因此,检查输出中的所有.pairs可能很有价值,特别是在具有紧密相邻集群的过度集群数据集中。

报告的p.value在统计意义上用处不大,仅供检查。从技术上讲,它可以被视为针对查询集群的双峰假设的Simes组合p值。然而,这并不能解释对每个选定集群的所有集群对的多重测试,特别是当我们选择与双零假设最一致的集群对时。

即使存在现有的大小因子,我们也使用库大小归一化(通过librarySizeFactors)。这是因为对于任意大小的因子,无法保证双峰簇的中间表达式。例如,如果使用大小因子中的尖峰进行归一化,则双峰簇中的表达式将高于源簇中的表达。


价值观
每个查询集群包含一行的DataFrame,具有以下字段:

       source1     source2         N        best              p.value           lib.size1         lib.size2               prop
    <character> <character> <integer> <character>            <numeric>           <numeric>         <numeric>          <numeric>
                                         all.pairs


来源1 source1:
指定第一个源集群标识的字符串。


来源2 source2:
指定第二个源集群标识的字符串。


N
整数,与两个假定的源集群相比,查询集群中明显非中间的基因数量。


最佳 best:
指定针对查询和源聚类组合的双重假设,具有最低p值的顶部基因的标识的字符串。

p.value:
数值,包含最佳基因的调整后p值。

lib.size1:
数字,第一个源集群与查询集群的中值库大小的比率。

lib.size2:
数值,第二个源集群与查询集群的中值库大小之比。


prop:
查询集群中单元格的比例。

 

all.pairs:
一个SimpleList对象,包含每对潜在源集群的上述统计信息。

每一行都根据其查询集群命名。

=========

REF

https://bioconductor.org/books/3.15/OSCA.advanced/doublet-detection.html

https://rdrr.io/bioc/scran/man/doubletCluster.html

标签:source,集群,cluster,scran,query,clusters,doublet,doubletCluster
From: https://www.cnblogs.com/emanlee/p/18491821

相关文章