<生信交流与合作请关注公众~号@生信探索>
python环境
CytoTRACE的iCytoTRACE函数需要调用python去除批次效应,因此需要先设置好python环境
mamba create -n SC && mamba activate SC
mamba install -y -c conda-forge python=3.10 notebook ipywidgets pandas numpy seaborn matplotlib ipykernel openpyxl pyarrow scanpy python-igraph leidenalg pytables jaxlib leidenalg
pip install scanoramaCT -i https://pypi.tuna.tsinghua.edu.cn/simple
which python
# /opt/homebrew/Caskroom/mambaforge/base/envs/SC/bin/python
修改bug
找到intervaltree/intervaltree.py这个脚本修改bug
#25行的
import collections
#替换为
import collections.abc as collections
#我的intervaltree/intervaltree.py文件路径在
/opt/homebrew/Caskroom/mambaforge/base/envs/SC/lib/python3.10/site-packages/intervaltree/intervaltree.py
如果你不知道去哪里找intervaltree/intervaltree.py文件可以在terminal中,倒入python库,报错的时候有文件路径,如果没报错就不需要修改了。
mamba activate SC
ipython
import scanoramaCT
安装CytoTRACE
using函数是我写在$HOME/.Rprofile
中的函数,因此每次打开R就能使用。
using
的功能是一次加载多个包,并且使用了suppressPackageStartupMessages
函数,因此不会显示加载包过程中的信息。
wget https://cytotrace.stanford.edu/CytoTRACE_0.3.3.tar.gz
using(remotes)
remotes::install_local("CytoTRACE_0.3.3.tar.gz")
从anndata导出数据
adata是注释好细胞类型的数据,CellType是细胞类型,library_id是不同样本编号代表批次效应。
这里使用了Arrow格式作为R和Python的中间数据,可以参考。
adata=sc.read("adata.h5ad").raw.to_adata()
# AnnData object with n_obs × n_vars = 14268 × 32285
# obs: 'CellType', 'library_id', 'n_genes', 'doublet_score', 'predicted_doublet', 'Cluster'
# uns: 'Cluster_colors', 'CellType_colors', 'hvg', 'leiden', 'library_id_colors', 'log1p', 'neighbors', 'pca', 'scrublet', 'umap'
# obsm: 'X_harmony', 'X_pca', 'X_umap'
# obsp: 'connectivities', 'distances'
adata.to_df().reset_index().to_feather("matrix.arrow",compression='zstd', compression_level=1)
# Phenotype tables, NO headers
adata.obs.to_csv("pd.csv",index=True,header=False)
# batch info
adata.obs.loc[:,['library_id']].to_csv("batch.csv",index=False,header=True)
R中设置python路径
using(reticulate)
reticulate::use_python("/opt/homebrew/Caskroom/mambaforge/base/envs/SC/bin/python")
reticulate::py_exe()
整理input
- 读入
using(data.table, arrow, CytoTRACE, dplyr, tidyr, purrr)
df <- arrow::read_ipc_file("matrix.arrow")
batch <- data.table::fread("batch.csv")
pd <- data.table::fread("pd.csv")
- 整理
datasets是没有名字的列表,里边是表达矩阵Matrix类型,phe是有名字的向量,名字是细胞barcode,值是细胞类型
a <- split(df, batch$library_id)
datasets <- purrr::map(a, ~ column_to_rownames(.x, "index") %>% t())
names(datasets) <- NULL
phe <- pd$V2
names(phe) <- pd$V1
rm(df, batch, pd) # 删除不需要的变量
iCytoTRACE
results <- CytoTRACE::iCytoTRACE(datasets, enableFast = TRUE, ncores = 8, subsamplesize = 1000)
CytoTRACE::plotCytoTRACE(results, phenotype = phe, gene = "Top2a", outputDir = './')
CytoTRACE::plotCytoGenes(results, numOfGenes = 10, outputDir = './')
Reference
https://mp.weixin.qq.com/s/S1-ClJEtR0ro0sYnIF6iaw
https://mp.weixin.qq.com/s/Al-FqOLMPBlrrT-JNchVhw
https://mp.weixin.qq.com/s/WiqU3nUFUysXf1M519l_Dg
标签:分化,python,CytoTRACE,推测,intervaltree,SC,using,adata
From: https://www.cnblogs.com/BioQuest/p/17322584.html