0. 准备
判断式安装R包:如果该R包存在,可以顺带加载该R包,不需要再次library。
if(!require(stringr))install.packages("stringr") if(!require(ggplotify))install.packages("ggplotify") if(!require(patchwork))install.packages("patchwork") if(!require(cowplot))install.packages("cowplot") if(!require(DESeq2))BiocManager::install("DESeq2") if(!require(edgeR))BiocManager::install("edgeR") if(!require(limma))BiocManager::install("limma") rm(list = ls()) load("TCGA-CHOLgdc.Rdata") table(group_list)
1. 常规R包
2. 热图转换为ggplot的类型
3-4. 拼图R包
5-7. 差异分析R包
9. 需要修改为上一步整理数据后保存的文件名
差异分析输入数据:表达矩阵exp,分组信息group_list。
1. DESeq2
差异分析
library(DESeq2) library(stringr) colData <- data.frame(row.names = colnames(exp), condition = group_list) dds <- DESeqDataSetFromMatrix(countData = exp, colData = colData, design = ~ condition) %>% DESeq()
3-4. 列出每个样品是肿瘤样品还是正常样品
5-8. 将数据框转为DESeq2的数据集类型,然后用DESeq函数做差异分析
提取结果,两两比较
res <- results(dds, contrast = c("condition",rev(levels(group_list)))) DEG <- res[order(res$pvalue),] %>% as.data.frame() head(DEG)
1-2. 按设置的比较水平提取dds的数据(从DESeq分析中提取结果表,给出样本的基本均值,logFC及其标准误,检验统计量,p值和矫正后的p值)。rev表示调换顺序,levels(group_list)是因子水平
3-4. 将res按照P值从小到大排序,并转换回数据框。order函数获取向量每个元素从小到大排序后的第几位
添加change列标记基因上调下调
logFC_cutoff <- with(DEG,mean(abs(log2FoldChange)) + 2*sd(abs(log2FoldChange))) k1 = (DEG$pvalue < 0.05)&(DEG$log2FoldChange < -logFC_cutoff) k2 = (DEG$pvalue < 0.05)&(DEG$log2FoldChange > logFC_cutoff) DEG$change = ifelse(k1,"DOWN",ifelse(k2,"UP","NOT")) table(DEG$change) head(DEG) DESeq2_DEG <- DEG
1. 设置logFC的阈值,可以计算得出,也可以设置固定值。with函数类似于取子集,表示在DEG数据集中提取某一列并进行计算,意义是可以在with的小括号内将数据框的一个列名当作变量名来使用,不需要再写$来取子集。abs函数计算log2FoldChange的绝对值。均数+2倍标准差可以包含log2FoldChange的95%置信区间
2-4. 判断并标记上、下调基因
5. 查看上下调基因的数量
7. 保存DESeq2差异分析的结果
2. edgeR
差异分析
library(edgeR) library(stringr) dge <- DGEList(counts = exp,group=group_list) dge$samples$lib.size <- colSums(dge$counts) dge <- calcNormFactors(dge) design <- model.matrix(~0+group_list) rownames(design)<-colnames(dge) colnames(design)<-levels(group_list) dge <- estimateGLMCommonDisp(dge,design) %>% estimateGLMTrendedDisp(design) %>% estimateGLMTagwiseDisp(design) fit <- glmFit(dge, design) %>% glmLRT( contrast=c(-1,1))
提取结果
DEG = topTags(fit, n=nrow(exp)) %>% as.data.frame() logFC_cutoff <- with(DEG,mean(abs(logFC)) + 2*sd(abs(logFC))) k1 = (DEG$PValue < 0.05)&(DEG$logFC < -logFC_cutoff) k2 = (DEG$PValue < 0.05)&(DEG$logFC > logFC_cutoff) DEG$change = ifelse(k1,"DOWN",ifelse(k2,"UP","NOT")) head(DEG) table(DEG$change) edgeR_DEG <- DEG
1-2. 提取差异分析结果
3. 设置logFC的阈值
4-6. 判断并标记上、下调基因
3. limma
差异分析
library(limma) library(stringr) design <- model.matrix(~0+group_list) colnames(design)=levels(group_list) rownames(design)=colnames(exp) dge <- DGEList(counts=exp) %>% calcNormFactors() fit <- voom(dge, design, normalize="quantile") %>% lmFit(design) constrasts = paste(rev(levels(group_list)), collapse = "-") cont.matrix <- makeContrasts(contrasts = constrasts, levels = design) fit2 <- contrasts.fit(fit, cont.matrix) %>% eBayes()
提取结果
DEG = topTable(fit2, coef=constrasts, n=Inf) %>% na.omit() logFC_cutoff <- with(DEG,mean(abs(logFC)) + 2*sd(abs(logFC))) k1 = (DEG$P.Value < 0.05)&(DEG$logFC < -logFC_cutoff) k2 = (DEG$P.Value < 0.05)&(DEG$logFC > logFC_cutoff) DEG$change = ifelse(k1,"DOWN",ifelse(k2,"UP","NOT")) table(DEG$change) head(DEG) limma_voom_DEG <- DEG
1-2. 提取差异分析结果
3. 设置logFC的阈值
4-6. 判断并标记上、下调基因
4. 数据检查
整合三大R包差异分析的数据:
tj = data.frame(deseq2 = as.integer(table(DESeq2_DEG$change)), edgeR = as.integer(table(edgeR_DEG$change)), limma_voom = as.integer(table(limma_voom_DEG$change)), row.names = c("down","not","up") );tj
验证数据:
验证logFC是否由tumor/normal计算而来:
boxplot(as.integer(exp[rownames(DESeq2_DEG)[1],])~group_list) DESeq2_DEG$log2FoldChange[1]
1. 取exp的第一行基因表达量,以group_list为分组依据,做箱线图。修改“DESeq2_DEG”即可分别分析三种方法。
2.得到DESeq2_DEG的第一个logFC值。如果箱线图中normal大tumor小,则logFC应为负值,反之为正值。
如果检查发现logFC的正负值反了,则修改为负值即可:DESeq2_DEG$log2FoldChange = -DESeq2_DEG$log2FoldChange
保存数据
save(DESeq2_DEG,edgeR_DEG,limma_voom_DEG,group_list,tj,file = paste0(cancer_type,"DEG.Rdata"))
标签:DESeq2,TCGA,list,limma,change,2.1,logFC,三大,DEG From: https://www.cnblogs.com/xiaogaobugao/p/16736353.html