Bamdst是一个轻量级工具,用于统计bam文件中目标区域的测序深度、覆盖度。
必须使用排序后的bam文件,bed文件和输出目录必须优先指定。
1.下载安装
git clone https://github.com/shiquan/bamdst.git
cd bamdst/
make
或者使用conda:
conda install xdgene::bamdst
安装后会出现可执行程序
2.使用方式
首先准备一个排序后的bam文件,STAR可以直接输出排好序的bam文件,--outSAMtype BAM SortedByCoordinate,也可以自己用samtools进行排序:
samtools sort {input.bam} --output-fmt BAM > {output.sorted.bam}
接下来准备一个bed文件,bed文件是一种以制表符为分隔的基因组注释文件,这里需要的格式很简单,只要必需的三列,染色体名、起始位点、终止位点(BED3),以该程序所给文件为例:
/example/MT-RNR1.bed
chrM 649 1603
这里bed的基础单位可以根据你的需求来选,chromosome、gene、exon等。请注意,间隔符为制表符,而不是连串空格
计算基因组全部数据,包含基因组所有染色体的位置。
可以用genome.fa.fai索引文件转化生成bed文件进行使用:
samtools faidx genome.fa ##建立索引
cat genome.fa.fai |awk '{print $1"\t1\t"$2}' > sample.bed
gene和exon等可以通过gff/gtf文件进行提取,前提是第三列有这些特征
#gff3文件示例
Chr1 MSU_osa1r7 gene 2903 10817 . + . ID=LOC_Os01g01010;Name=LOC_Os01g01010;...
Chr1 MSU_osa1r7 mRNA 2903 10817 . + . ID=LOC_Os01g01010.1;...
Chr1 MSU_osa1r7 exon 2903 3268 . + . ID=LOC_Os01g01010.1:exon_1;...
awk '$3=="exon"' file.gff3 |awk 'BEGIN{FS="\t|=|;";OFS="\t"}{print $1,$4,$5}' > exon.bed
两个文件都准备好后就可以使用了,找一个输出目录(这里我之前想给输出文件加个前缀来着,结果发现没输出(。_。)):
Normal:
bamdst -p <probe.bed> -o ./ in1.bam
Pipeline mode:
samtools view in1.bam -u | bamdst -p x.bed -o ./ -
(可以通过 bamdst --help 查看使用方法)
必须参数:
-o, --outdir output dir
--输出目录
-p, --bed probe or target regions file, the region file will be merged before calculate depths
--探测/探针(?)或目标区域文件,区域文件将在计算深度之前合并,也就是bed文件
可选参数:
参数 | 描述 |
---|---|
-f, --flank [200] | flank n bp of each region. --设置侧翼区域的大小,如果要计算侧翼区的覆盖率,请设置此值,默认值为 200。 [以5'侧翼区为例,侧翼区是指与基因5’末端毗连的DNA区域,这些区域通常包含启动子、增强子或者其他蛋白质结合位点,这部分的区域不能被转录为RNA。侧翼区(flanking region)和侧翼区单核苷酸多态性(Flanking SNPs):https://www.cnblogs.com/chenwenyan/p/5802874.html ] [侧翼序列:指存在于编码区第一个外显子和最末一个外显子的外侧是一段不被翻译的核苷酸序列。侧翼序列含有基因调控顺序,如5端含有的启动子、增强子,3端含有的终止子和多聚腺苷酸信号等,对基因表达起重要的调控作用。(之后总结) ] |
-q [20] | map quality cutoff value, greater or equal to the value will be count. --比对质量截止值,大于或等于该值将被计数。 影响结果中 [Total] Map quality cutoff value // Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads. |
--maxdepth [0] | set the max depth to stat the cumu distribution. --设置最大深度来统计累积分布。 对于一些项目来说,区域深度非常高,如果你不想在 Cumulation Distrbution 文件(depth_distribution.plot)中显示这些不正常的深度,可以设置 Cutoff 值来过滤它们。默认值为 0 (无过滤)。 |
--cutoffdepth [0] | list the coverage of above depths. --列出上述深度的覆盖度。 对于一些项目,人们关心指定深度的覆盖率,比如 10000x 等。bamdst 只计算 0x、4x、10x、30x、100x 的覆盖率,所以你可以设置这个值以在 coverage.report 文档中显示指定的 coverage。默认值为 0。 |
--isize [2000] | stat the inferred insert size under this value. --推断插入尺寸低于此值 对于错误的映射配对读长,推断的插入大小非常大。所以设置一个截止值。默认值为 2000。 |
--uncover [5] | region will included in uncover file if below it. --如果区域低于 Uncover 文档,则 Region 将包含在 Uncover 文档中。 为 Calculate the bad covered region (计算不良覆盖区域) 设置此截止值。默认值为 <5。 |
--bamout BAMFILE | target reads will be exported to this bam file. --目标reads将导出到此 BAM 文档。 |
-1 | begin position of bed file is 1-based. --bed的起始位置从1开始。 |
-h, --help | print this help info. --打印帮助信息。 |
还有一个参数已经被更新掉了,我用的版本是1.0.9
--use_rmdup (an invalid parament since v1.0.0 )
Use rmdup depth instead of cover depth to calculate the coverage of target regions and so on.
使用 rmdup depth 而不是 cover depth 来计算目标区域的覆盖率等等。
3.输出结果
主要输出7个文件
#--help中的介绍
* Five essential files would be created in the output dir.
* region.tsv.gz and depth.tsv.gz are zipped by bgzip, so you can use tabix
index these files.
- coverage.report a report of the coverage information and reads
information of whole target regions
- cumu.plot distribution data of depth values #(这里应该是指文件:depth_distribution.plot(?可能作者更新没改这里?))
- insert.plot distribution data of inferred insert size
- chromosome.report coverage information for each chromosome
- region.tsv.gz mean depth, median depth and coverage of each region
- depth.tsv.gz raw depth, rmdup depth, coverage depth of each position
- uncover.bed the bad covered or uncovered region in the probe file
* About depth.tsv.gz:
* There are five columns in this file, including chromosome, position, raw
* depth, rmdep depth, coverage depth
- chromosome the chromosome name
- position 1-based position of each chromosome
- raw depth raw depth of position, not filter
- rmdup depth remove duplication, and only calculate the reads which
are primary mapped and mapQ >= cutoff_mapQ (default 20)
- coverage depth calculate the deletions (CIGAR level) into depths,
for coverage use.
输出文件内容详解:
chromosomes.report
从bam文件中获取的每条染色体的深度、覆盖度等信息
#Chromosome DATA(%) Avg depth Median Coverage% Cov 4x % Cov 10x % Cov 30x % Cov 100x %
Chr1 13.38 90.92 4.0 60.75 51.37 43.06 31.89 18.56
coverage.report
总结的统计信息结果,这个的文件的信息有很多
## The file was created by bamdst
## Version : 1.0.9
## Files : /share/ofs1b/project/RNA_Seq/NLJKX20241028004_RnaSeq_T11_051/02_align/HF_N1_bam/HF_N1_Aligned.sort.out.bam
[Total] Raw Reads (All reads) 95543390 #All reads in the bam file(s).
[Total] QC Fail reads 0 #Reads number failed QC, this flag is marked by other software,like bwa. See flag in the bam structure.
[Total] Raw Data(Mb) 14248.07 #Total reads data in the bam file(s).
[Total] Paired Reads 95543390 #Paired reads numbers.
[Total] Mapped Reads 95543390 #Mapped reads numbers.
[Total] Fraction of Mapped Reads 100.00% #Ratio of mapped reads against raw reads.
[Total] Mapped Data(Mb) 14248.07 #Mapped data in the bam file(s).
[Total] Fraction of Mapped Data(Mb) 100.00% #Ratio of mapped data against raw data.
[Total] Properly paired 95543384 #Paired reads with properly insert size. See bam format protocol for details.
[Total] Fraction of Properly paired 100.00% #Ratio of properly paired reads against mapped reads
[Total] Read and mate paired 95543384 #Read (read1) and mate read (read2) paired.
[Total] Fraction of Read and mate paired 100.00% #Ratio of read and mate paired against mapped reads
[Total] Singletons 6 #Read mapped but mate read unmapped, and vice versa.
[Total] Read and mate map to diff chr 0 #Read and mate read mapped to different chromosome, usually because mapping error and structure variants.
[Total] Read1 47771698 #First reads in mate paired sequencing
[Total] Read2 47771692 #Mate reads
[Total] Read1(rmdup) 47771698 #First reads after remove duplications.
[Total] Read2(rmdup) 47771692 #Mate reads after remove duplications.
[Total] forward strand reads 47771695 #Number of forward strand reads.
[Total] backward strand reads 47771695 #Number of backward strand reads.
[Total] PCR duplicate reads 0 #PCR duplications.
[Total] Fraction of PCR duplicate reads 0.00% #Ratio of PCR duplications.
[Total] Map quality cutoff value 20 #Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads.
[Total] MapQuality above cutoff reads 93664629 #Number of reads with higher or equal quality score than cutoff value.
[Total] Fraction of MapQ reads in all reads 98.03% #Ratio of reads with higher or equal Q score against raw reads.
[Total] Fraction of MapQ reads in mapped reads 98.03% #Ratio of reads with higher or equal Q score against mapped reads.
[Target] Target Reads 91110291 #Number of reads covered target region (specified by bed file).
[Target] Fraction of Target Reads in all reads 95.36% #Ratio of target reads against raw reads.
[Target] Fraction of Target Reads in mapped reads 95.36% #Ratio of target reads against mapped reads.
[Target] Target Data(Mb) 13435.50 #Total bases covered target region. If a read covered target region partly, only the covered bases will be counted.
[Target] Target Data Rmdup(Mb) 13224.10 #Total bases covered target region after remove PCR duplications.
[Target] Fraction of Target Data in all data 94.30% #Ratio of target bases against raw bases.
[Target] Fraction of Target Data in mapped data 94.30% #Ratio of target bases against mapped bases.
[Target] Len of region 164886854 #The length of target regions.
[Target] Average depth 81.48 #Average depth of target regions. Calculated by "target bases / length of regions".
[Target] Average depth(rmdup) 80.20 #Average depth of target regions after remove PCR duplications.
[Target] Coverage (>0x) 51.61% #Ratio of bases with depth greater than 0x in target regions, which also means the ratio of covered regions in target regions.
[Target] Coverage (>=4x) 43.23% #Ratio of bases with depth greater than or equal to 4x in target regions.
[Target] Coverage (>=10x) 36.09% #Ratio of bases with depth greater than or equal to 10x in target regions.
[Target] Coverage (>=30x) 26.76% #Ratio of bases with depth greater than or equal to 30x in target regions.
[Target] Coverage (>=100x) 15.74% #Ratio of bases with depth greater than or equal to 100x in target regions.
[Target] Coverage (>=Nx) #This is addtional line for user self-defined cutoff value, see --cutoffdepth
[Target] Target Region Count 55104 #Number of target regions. In normal practise,it is the total number of exomes.
[Target] Region covered > 0x 25846 #The number of these regions with average depth greater than 0x.
[Target] Fraction Region covered > 0x 46.90% #Ratio of these regions with average depth greater than 0x.
[Target] Fraction Region covered >= 4x 41.22% #Ratio of these regions with average depth greater than or equal to 4x.
[Target] Fraction Region covered >= 10x 36.67% #Ratio of these regions with average depth greater than or equal to 10x.
[Target] Fraction Region covered >= 30x 29.09% #Ratio of these regions with average depth greater than or equal to 30x.
[Target] Fraction Region covered >= 100x 15.76% #Ratio of these regions with average depth greater than or equal to 100x.
[flank] flank size 200 #The flank size will be count. 200 bp in default. Oligos could also capture the nearby regions of target regions.
[flank] Len of region (not include target region) 186299040 #The length of flank regions (target regions will not be count).
[flank] Average depth 72.94 #Average depth of flank regions.
[flank] flank Reads 91839874 #The total number of reads covered the flank regions. Note: some reads covered the edge of target regions, will be count in flank regions also.
[flank] Fraction of flank Reads in all reads 96.12% #Ratio of reads covered in flank regions against raw reads.
[flank] Fraction of flank Reads in mapped reads 96.12% #Ration of reads covered in flank regions against mapped reads.
[flank] flank Data(Mb) 13588.68 #Total bases in the flank regions.
[flank] Fraction of flank Data in all data 95.37% #Ratio of total bases in the flank regions against raw data.
[flank] Fraction of flank Data in mapped data 95.37% #Ratio of total bases in the flank regions against mapped data.
[flank] Coverage (>0x) 48.01% #Ratio of flank bases with depth greater than 0x.
[flank] Coverage (>=4x) 39.51% #Ratio of flank bases with depth greater than or equal to 4x.
[flank] Coverage (>=10x) 32.68% #Ratio of flank bases with depth greater than or equal to 10x.
[flank] Coverage (>=30x) 24.06% #Ratio of flank bases with depth greater than or equal to 30x.
[flank] Coverage (>=100x) 14.07% #Ratio of flank bases with depth greater than or equal to 100x.(官网没这句,但我结果有,我自己加了一句)
depth_distribution.plot
结合R可以做出目标区域的深度分布图(.plot结尾的几个文件应该都可以作图,不过我没做这里,后面试试吧)
0 79772141 0.483856 85095379 0.516144
1 5844883 0.035452 79250496 0.480692
2 4686107 0.028423 74564389 0.452269
3 3289529 0.019953 71274860 0.432316
4 2778824 0.016855 68496036 0.415461
左边三列分别代表:覆盖深度,对应该深度的碱基数,碱基占比(以目标区域的碱基数为分母)。
右边两列是高深度向低深度累加的值,分别是碱基数及其占比,累加到X=1为止。
depth.tsv
对于探针文档 (in.bed) 中的每个位置,三种深度值,包括原始深度raw depth、删除重复读取的深度Rmdup depth和覆盖深度Cover depth。原始深度是从输入 bam 中检索的。Rmdup depth为去除了PCR重复的reads, 次优比对reads, 低比对质量的reads(mapQ < 20),该值与samtools depth的输出深度类似
其中,输出深度为 samtools 深度。覆盖深度是考虑删除区域的原始深度,因此此值应等于或大于原始深度。我们使用原始深度raw depth来统计 coverage.report 文件中的覆盖率信息。如果想用rmdup depth来计算覆盖率,可使用参数 “--use_rmdup”。
#Chr Pos Raw Depth Rmdup depth Cover depth
Chr1 2904 2 2 2
Chr1 2905 2 2 2
Chr1 2906 2 2 2
insertsize.plot
做出两个接头之间的插入片段大小的图,理论上是根据read1和read2在染色体上的坐标信息求得,这时read1和read2要求比对到一条染色体上。
0 0 0.000000 45594984 1.000000
1 0 0.000000 45594984 1.000000
2 0 0.000000 45594984 1.000000
3 0 0.000000 45594984 1.000000
region.tsv
可以用来评估捕获效率,结合基因自身的结构能探究其对捕获测序的影响
目标区域文件每一个区域的统计,其中Coverage以X>0来统计
#Chr Start Stop Avg depth Median Coverage Coverage(FIX)
Chr1 2903 10817 72.74 31.0 70.86 70.86
Chr1 11218 12435 21.75 21.0 100.00 100.00
uncover.bed
没有捕获的区域
--uncover的值默认是5,包含低覆盖或者是未覆盖的,通过"--uncover"规定“未覆盖”的阈值。
Chr1 2903 2914
Chr1 3751 4360
Chr1 4596 5458
第一次写这种,markdown甚至用的还不太熟练(所以啥都试了试),有些翻译可能有点词不达意,建议看看官网,如果有误还望指出,之后应该会再整理一些自己用过的软件或者包,当然可能已经烂大街了(ˉ▽ˉ;)...整理一下前人智慧自己看着玩儿吧。
整理参考,感谢:
https://www.jianshu.com/p/ac7b8e4d76e4
https://zhuanlan.zhihu.com/p/548250737
https://github.com/shiquan/bamdst