首页 > 其他分享 >[使用整理-1] bamdst:统计覆盖度和测序深度

[使用整理-1] bamdst:统计覆盖度和测序深度

时间:2024-11-06 16:02:15浏览次数:1  
标签:Target 覆盖度 测序 reads regions depth flank bamdst Total

BAMDST

Bamdst是一个轻量级工具,用于统计bam文件中目标区域的测序深度、覆盖度。
必须使用排序后的bam文件,bed文件和输出目录必须优先指定。

1.下载安装

  git clone https://github.com/shiquan/bamdst.git
  cd bamdst/
  make

或者使用conda:

  conda install xdgene::bamdst

安装后会出现可执行程序

2.使用方式

首先准备一个排序后的bam文件,STAR可以直接输出排好序的bam文件,--outSAMtype BAM SortedByCoordinate,也可以自己用samtools进行排序:

  samtools sort {input.bam} --output-fmt BAM > {output.sorted.bam}

接下来准备一个bed文件,bed文件是一种以制表符为分隔的基因组注释文件,这里需要的格式很简单,只要必需的三列,染色体名、起始位点、终止位点(BED3),以该程序所给文件为例:

/example/MT-RNR1.bed

  chrM   649  	1603

这里bed的基础单位可以根据你的需求来选,chromosome、gene、exon等。请注意,间隔符为制表符,而不是连串空格

计算基因组全部数据,包含基因组所有染色体的位置。
可以用genome.fa.fai索引文件转化生成bed文件进行使用:

  samtools faidx genome.fa  ##建立索引
  cat genome.fa.fai |awk '{print $1"\t1\t"$2}' > sample.bed

gene和exon等可以通过gff/gtf文件进行提取,前提是第三列有这些特征

#gff3文件示例
Chr1	MSU_osa1r7	gene	2903	10817	.	+	.	ID=LOC_Os01g01010;Name=LOC_Os01g01010;...
Chr1	MSU_osa1r7	mRNA	2903	10817	.	+	.	ID=LOC_Os01g01010.1;...
Chr1	MSU_osa1r7	exon	2903	3268	.	+	.	ID=LOC_Os01g01010.1:exon_1;...

awk '$3=="exon"' file.gff3 |awk 'BEGIN{FS="\t|=|;";OFS="\t"}{print $1,$4,$5}' > exon.bed

两个文件都准备好后就可以使用了,找一个输出目录(这里我之前想给输出文件加个前缀来着,结果发现没输出(。_。)):

  Normal:
    bamdst -p <probe.bed> -o ./ in1.bam
  Pipeline mode:
    samtools view in1.bam -u | bamdst -p x.bed -o ./ -
(可以通过 bamdst --help 查看使用方法)

必须参数:

  -o, --outdir         output dir 
                       --输出目录
  -p, --bed            probe or target regions file, the region file will be merged before calculate depths 
                       --探测/探针(?)或目标区域文件,区域文件将在计算深度之前合并,也就是bed文件

可选参数:

参数 描述
-f, --flank [200] flank n bp of each region.
--设置侧翼区域的大小,如果要计算侧翼区的覆盖率,请设置此值,默认值为 200。
[以5'侧翼区为例,侧翼区是指与基因5’末端毗连的DNA区域,这些区域通常包含启动子、增强子或者其他蛋白质结合位点,这部分的区域不能被转录为RNA。侧翼区(flanking region)和侧翼区单核苷酸多态性(Flanking SNPs):https://www.cnblogs.com/chenwenyan/p/5802874.html ]
[侧翼序列:指存在于编码区第一个外显子和最末一个外显子的外侧是一段不被翻译的核苷酸序列。侧翼序列含有基因调控顺序,如5端含有的启动子、增强子,3端含有的终止子和多聚腺苷酸信号等,对基因表达起重要的调控作用。(之后总结) ]
-q [20] map quality cutoff value, greater or equal to the value will be count.
--比对质量截止值,大于或等于该值将被计数。
影响结果中 [Total] Map quality cutoff value // Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads.
--maxdepth [0] set the max depth to stat the cumu distribution.
--设置最大深度来统计累积分布。
对于一些项目来说,区域深度非常高,如果你不想在 Cumulation Distrbution 文件(depth_distribution.plot)中显示这些不正常的深度,可以设置 Cutoff 值来过滤它们。默认值为 0 (无过滤)。
--cutoffdepth [0] list the coverage of above depths.
--列出上述深度的覆盖度。
对于一些项目,人们关心指定深度的覆盖率,比如 10000x 等。bamdst 只计算 0x、4x、10x、30x、100x 的覆盖率,所以你可以设置这个值以在 coverage.report 文档中显示指定的 coverage。默认值为 0。
--isize [2000] stat the inferred insert size under this value.
--推断插入尺寸低于此值 对于错误的映射配对读长,推断的插入大小非常大。所以设置一个截止值。默认值为 2000。
--uncover [5] region will included in uncover file if below it.
--如果区域低于 Uncover 文档,则 Region 将包含在 Uncover 文档中。 为 Calculate the bad covered region (计算不良覆盖区域) 设置此截止值。默认值为 <5。
--bamout BAMFILE target reads will be exported to this bam file.
--目标reads将导出到此 BAM 文档。
-1 begin position of bed file is 1-based.
--bed的起始位置从1开始。
-h, --help print this help info.
--打印帮助信息。

还有一个参数已经被更新掉了,我用的版本是1.0.9
--use_rmdup (an invalid parament since v1.0.0 )
Use rmdup depth instead of cover depth to calculate the coverage of target regions and so on.
使用 rmdup depth 而不是 cover depth 来计算目标区域的覆盖率等等。


3.输出结果

主要输出7个文件

#--help中的介绍
* Five essential files would be created in the output dir. 
* region.tsv.gz and depth.tsv.gz are zipped by bgzip, so you can use tabix 
  index these files.

 - coverage.report     a report of the coverage information and reads 
                       information of whole target regions
 - cumu.plot           distribution data of depth values  #(这里应该是指文件:depth_distribution.plot(?可能作者更新没改这里?))
 - insert.plot         distribution data of inferred insert size 
 - chromosome.report   coverage information for each chromosome
 - region.tsv.gz       mean depth, median depth and coverage of each region
 - depth.tsv.gz        raw depth, rmdup depth, coverage depth of each position
 - uncover.bed         the bad covered or uncovered region in the probe file

* About depth.tsv.gz:
* There are five columns in this file, including chromosome, position, raw
* depth, rmdep depth, coverage depth
 - chromosome          the chromosome name
 - position            1-based position of each chromosome
 - raw depth           raw depth of position, not filter
 - rmdup depth         remove duplication, and only calculate the reads which
                       are primary mapped and mapQ >= cutoff_mapQ (default 20)
 - coverage depth      calculate the deletions (CIGAR level) into depths,
                       for coverage use.


输出文件内容详解:

chromosomes.report

从bam文件中获取的每条染色体的深度、覆盖度等信息

#Chromosome	   DATA(%)	 Avg depth	    Median	 Coverage%	  Cov 4x %	 Cov 10x %	 Cov 30x %	Cov 100x %
       Chr1	   13.38	   90.92	      4.0	   60.75	   51.37	   43.06	   31.89	   18.56

coverage.report

总结的统计信息结果,这个的文件的信息有很多

## The file was created by bamdst
## Version : 1.0.9
## Files : /share/ofs1b/project/RNA_Seq/NLJKX20241028004_RnaSeq_T11_051/02_align/HF_N1_bam/HF_N1_Aligned.sort.out.bam 
                               [Total] Raw Reads (All reads)	95543390  #All reads in the bam file(s).
                                       [Total] QC Fail reads	0  #Reads number failed QC, this flag is marked by other software,like bwa. See flag in the bam structure.
                                        [Total] Raw Data(Mb)	14248.07  #Total reads data in the bam file(s).
                                        [Total] Paired Reads	95543390  #Paired reads numbers.
                                        [Total] Mapped Reads	95543390  #Mapped reads numbers.
                            [Total] Fraction of Mapped Reads	100.00%  #Ratio of mapped reads against raw reads.
                                     [Total] Mapped Data(Mb)	14248.07  #Mapped data in the bam file(s).
                         [Total] Fraction of Mapped Data(Mb)	100.00%  #Ratio of mapped data against raw data.
                                     [Total] Properly paired	95543384  #Paired reads with properly insert size. See bam format protocol for details.
                         [Total] Fraction of Properly paired	100.00%  #Ratio of properly paired reads against mapped reads
                                [Total] Read and mate paired	95543384  #Read (read1) and mate read (read2) paired.
                    [Total] Fraction of Read and mate paired	100.00%  #Ratio of read and mate paired against mapped reads
                                          [Total] Singletons	6  #Read mapped but mate read unmapped, and vice versa.
                       [Total] Read and mate map to diff chr	0  #Read and mate read mapped to different chromosome, usually because mapping error and structure variants.
                                               [Total] Read1	47771698  #First reads in mate paired sequencing
                                               [Total] Read2	47771692  #Mate reads
                                        [Total] Read1(rmdup)	47771698  #First reads after remove duplications.
                                        [Total] Read2(rmdup)	47771692  #Mate reads after remove duplications.
                                [Total] forward strand reads	47771695  #Number of forward strand reads.
                               [Total] backward strand reads	47771695  #Number of backward strand reads.
                                 [Total] PCR duplicate reads	0  #PCR duplications.
                     [Total] Fraction of PCR duplicate reads	0.00%  #Ratio of PCR duplications.
                            [Total] Map quality cutoff value	20  #Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads.
                       [Total] MapQuality above cutoff reads	93664629  #Number of reads with higher or equal quality score than cutoff value.
                 [Total] Fraction of MapQ reads in all reads	98.03%  #Ratio of reads with higher or equal Q score against raw reads.
              [Total] Fraction of MapQ reads in mapped reads	98.03%  #Ratio of reads with higher or equal Q score against mapped reads.
                                       [Target] Target Reads	91110291  #Number of reads covered target region (specified by bed file).
              [Target] Fraction of Target Reads in all reads	95.36%  #Ratio of target reads against raw reads.
           [Target] Fraction of Target Reads in mapped reads	95.36%  #Ratio of target reads against mapped reads.
                                    [Target] Target Data(Mb)	13435.50  #Total bases covered target region. If a read covered target region partly, only the covered bases will be counted.
                              [Target] Target Data Rmdup(Mb)	13224.10  #Total bases covered target region after remove PCR duplications. 
                [Target] Fraction of Target Data in all data	94.30%  #Ratio of target bases against raw bases.
             [Target] Fraction of Target Data in mapped data	94.30%  #Ratio of target bases against mapped bases.
                                      [Target] Len of region	164886854  #The length of target regions.
                                      [Target] Average depth	81.48  #Average depth of target regions. Calculated by "target bases / length of regions".
                               [Target] Average depth(rmdup)	80.20  #Average depth of target regions after remove PCR duplications.
                                     [Target] Coverage (>0x)	51.61%  #Ratio of bases with depth greater than 0x in target regions, which also means the ratio of covered regions in target regions.
                                    [Target] Coverage (>=4x)	43.23%  #Ratio of bases with depth greater than or equal to 4x in target regions.
                                   [Target] Coverage (>=10x)	36.09%  #Ratio of bases with depth greater than or equal to 10x in target regions.
                                   [Target] Coverage (>=30x)	26.76%  #Ratio of bases with depth greater than or equal to 30x in target regions.
                                  [Target] Coverage (>=100x)	15.74%  #Ratio of bases with depth greater than or equal to 100x in target regions.
                                  [Target] Coverage (>=Nx)              #This is addtional line for user self-defined cutoff value, see --cutoffdepth
                                [Target] Target Region Count	55104  #Number of target regions. In normal practise,it is the total number of exomes.
                                [Target] Region covered > 0x	25846  #The number of these regions with average depth greater than 0x.
                       [Target] Fraction Region covered > 0x	46.90%  #Ratio of these regions with average depth greater than 0x.
                      [Target] Fraction Region covered >= 4x	41.22%  #Ratio of these regions with average depth greater than or equal to 4x.
                     [Target] Fraction Region covered >= 10x	36.67%  #Ratio of these regions with average depth greater than or equal to 10x.
                     [Target] Fraction Region covered >= 30x	29.09%  #Ratio of these regions with average depth greater than or equal to 30x.
                    [Target] Fraction Region covered >= 100x	15.76%  #Ratio of these regions with average depth greater than or equal to 100x.
                                          [flank] flank size	200  #The flank size will be count. 200 bp in default. Oligos could also capture the nearby regions of target regions.
           [flank] Len of region (not include target region)	186299040  #The length of flank regions (target regions will not be count).
                                       [flank] Average depth	72.94  #Average depth of flank regions.
                                         [flank] flank Reads	91839874  #The total number of reads covered the flank regions. Note: some reads covered the edge of target regions, will be count in flank regions also. 
                [flank] Fraction of flank Reads in all reads	96.12%  #Ratio of reads covered in flank regions against raw reads.
             [flank] Fraction of flank Reads in mapped reads	96.12%  #Ration of reads covered in flank regions against mapped reads.
                                      [flank] flank Data(Mb)	13588.68  #Total bases in the flank regions.
                  [flank] Fraction of flank Data in all data	95.37%  #Ratio of total bases in the flank regions against raw data.
               [flank] Fraction of flank Data in mapped data	95.37%  #Ratio of total bases in the flank regions against mapped data.
                                      [flank] Coverage (>0x)	48.01%  #Ratio of flank bases with depth greater than 0x.
                                     [flank] Coverage (>=4x)	39.51%  #Ratio of flank bases with depth greater than or equal to 4x.
                                    [flank] Coverage (>=10x)	32.68%  #Ratio of flank bases with depth greater than or equal to 10x.
                                    [flank] Coverage (>=30x)	24.06%  #Ratio of flank bases with depth greater than or equal to 30x.
                                   [flank] Coverage (>=100x)	14.07%  #Ratio of flank bases with depth greater than or equal to 100x.(官网没这句,但我结果有,我自己加了一句)

depth_distribution.plot

结合R可以做出目标区域的深度分布图(.plot结尾的几个文件应该都可以作图,不过我没做这里,后面试试吧

0	79772141	0.483856	85095379	0.516144
1	5844883	0.035452	79250496	0.480692
2	4686107	0.028423	74564389	0.452269
3	3289529	0.019953	71274860	0.432316
4	2778824	0.016855	68496036	0.415461

左边三列分别代表:覆盖深度,对应该深度的碱基数,碱基占比(以目标区域的碱基数为分母)。
右边两列是高深度向低深度累加的值,分别是碱基数及其占比,累加到X=1为止。

depth.tsv

对于探针文档 (in.bed) 中的每个位置,三种深度值,包括原始深度raw depth、删除重复读取的深度Rmdup depth和覆盖深度Cover depth。原始深度是从输入 bam 中检索的。Rmdup depth为去除了PCR重复的reads, 次优比对reads, 低比对质量的reads(mapQ < 20),该值与samtools depth的输出深度类似
其中,输出深度为 samtools 深度。覆盖深度是考虑删除区域的原始深度,因此此值应等于或大于原始深度。我们使用原始深度raw depth来统计 coverage.report 文件中的覆盖率信息。如果想用rmdup depth来计算覆盖率,可使用参数 “--use_rmdup”。

#Chr    Pos     Raw Depth       Rmdup depth     Cover depth
Chr1    2904    2       2       2
Chr1    2905    2       2       2
Chr1    2906    2       2       2

insertsize.plot

做出两个接头之间的插入片段大小的图,理论上是根据read1和read2在染色体上的坐标信息求得,这时read1和read2要求比对到一条染色体上。

0	0	0.000000	45594984	1.000000
1	0	0.000000	45594984	1.000000
2	0	0.000000	45594984	1.000000
3	0	0.000000	45594984	1.000000

region.tsv

可以用来评估捕获效率,结合基因自身的结构能探究其对捕获测序的影响
目标区域文件每一个区域的统计,其中Coverage以X>0来统计

#Chr    Start   Stop    Avg depth       Median  Coverage        Coverage(FIX)
Chr1    2903    10817   72.74   31.0    70.86   70.86
Chr1    11218   12435   21.75   21.0    100.00  100.00

uncover.bed

没有捕获的区域
--uncover的值默认是5,包含低覆盖或者是未覆盖的,通过"--uncover"规定“未覆盖”的阈值。

Chr1	2903	2914
Chr1	3751	4360
Chr1	4596	5458

第一次写这种,markdown甚至用的还不太熟练(所以啥都试了试),有些翻译可能有点词不达意,建议看看官网,如果有误还望指出,之后应该会再整理一些自己用过的软件或者包,当然可能已经烂大街了(ˉ▽ˉ;)...整理一下前人智慧自己看着玩儿吧。


整理参考,感谢:
https://www.jianshu.com/p/ac7b8e4d76e4
https://zhuanlan.zhihu.com/p/548250737
https://github.com/shiquan/bamdst

标签:Target,覆盖度,测序,reads,regions,depth,flank,bamdst,Total
From: https://www.cnblogs.com/Janesy7499/p/18529377

相关文章

  • Mol Plant | 程时锋综述:植物基因组重测序与群体基因组学
    分享一篇2023年7月25日发表在MolecularPlant上的一篇综述文章:“Plantgenomeresequencingandpopulation genomics:Currentstatusandfutureprospects”,总结了植物基于群体的基因组重测序研究的进展以及这些研究对作物育种的影响。全文摘要这篇论文总结了植物基因组重......
  • 测序数据质控软件
    测序数据质控是高通量测序(Next-GenerationSequencing,NGS)流程中的重要环节。测序数据的质量直接影响下游分析结果的准确性,因此在分析之前,必须对原始数据进行质控,确保数据可靠。测序数据质控的目标去除低质量数据:过滤掉碱基质量较差的reads,减少噪音。去除接头序列:避免接......
  • 【GEE-PIE遥感】夜间灯光指数提取、长时间尺度植被覆盖度反演、水域动态监测、农作物
    随着航空、航天、近地空间等多个遥感平台的不断发展,近年来遥感技术突飞猛进。由此,遥感数据的空间、时间、光谱分辨率不断提高,数据量也大幅增长,使其越来越具有大数据特征。对于相关研究而言,遥感大数据的出现为其提供了前所未有的机遇,但同时也提出了巨大的挑战。传统的工作站和服......
  • Nat Genet | 发现8个新基因!外显子组测序揭示罕见编码基因突变对成人认知功能的影响
    认知功能是一种复杂的心理过程,包括注意力、记忆力、处理速度、空间能力、语言和解决问题的能力,很难单独评估。已有研究表明,成人认知功能受到遗传的强烈影响。基于常见变异的全基因组关联研究(GWAS),目前已经确定了近4000个个体效应较小的认知功能基因位点。同时,GWAS还证明了认知功能和......
  • 下载测序数据那些事儿(一)
    文章目录前言检索下载数据坑聊聊天前言  最近在下载(分析)公共数据,无法避免的从NCBISRA数据库下载已发表的“”优质“”数据。曾经一直以为,数据下载就是小case,直到我因为下载数据,折腾了几天……所以,聊一下目前下载拆分单细胞转录组测序数据踩过的SHIT。希望能为......
  • Cell Reports | 下一代混池测序技术(NG-BSA)助力育种4.0
    分享一篇去年由中国农科院作物所张红伟和华中农业大学李林团队发表在CellReports上的综述文章:Next-generationbulkedsegregantanalysisforBreeding4.0。该综述在总结传统混池测序技术(BSA)的基础上,提出了NG-BSA的研究策略。NG-BSA通过整合高通量表型技术、生物大数据技术与机......
  • Cell Reports | 下一代混池测序技术(NG-BSA)助力育种4.0
    分享一篇去年由中国农科院作物所张红伟和华中农业大学李林团队发表在CellReports上的综述文章:Next-generationbulkedsegregantanalysisforBreeding4.0。该综述在总结传统混池测序技术(BSA)的基础上,提出了NG-BSA的研究策略。NG-BSA通过整合高通量表型技术、生物大数据技术与......
  • 单细胞测序与分子育种
    植物细胞具有异质性,同一基因的表达量在同一组织的不同细胞类型中可能不同,当我们挖掘胁迫响应基因或发育相关基因时,如果利用传统的转录组测序(BulkRNA-seq)策略去检测基因的表达模式,一些细胞类型特异性基因的表达信号将会被掩盖,也使我们很难从细胞水平检测出最先响应的因子,以及探究......
  • 易基因:多组学测序分析揭示m5C甲基化上调E2F1表达以促进卵巢癌肿瘤进展|Nature子刊
    大家好,这里是专注表观组学十余年,领跑多组学科研服务的易基因。卵巢癌是全球死亡率最高的妇科癌症,其中上皮性卵巢癌是最常见的卵巢癌类型。由于缺乏可靠的卵巢癌早期筛查,导致诊断延迟,5年生存率仅为50%。尽管近些年来治疗技术取得了进展,但由于其发病机制的基因调控网络不明确,卵巢癌......
  • 全流程基于最新导则下的生态环境影响评价技术方法及图件制作(土地利用现状图的制作、植
    根据最新生态环境影响评价导则,结合生态环评内容庞杂、综合性强的特点,以既包括陆域、又包括水域的项目为主要案例,对生态环评的具体流程及所需内容进行系统阐述。利用Rstudio、Fragstats等软件分析计算生态环评中所需各种指数,利用ENVI、Maxent等软件分析制作生态环评中所需各种图......