总共三个大步骤:
step1:提取503例EUR-Sample的1000G.EUR.QC.chr,通过python脚本批量跑plink得到
step2: 提取my-MDD中SNP的1000G.EUR.QC.chr-sub-chr,通过python脚本批量跑plink得到
step3: 进行pruning,得到MDD.chr*_plink_prune_EUR_filtered_LD0.8.prune.in,通过python脚本批量跑plink得到
且对每个python脚本,要提前准备好两个参数文件:1.keep_sample/SNP_file 2.plink_list_file [1列file_name,1列chr_name]
暂存版本
##样本Sample的提取 cat get_extract_sample.py
import sys
import os
if __name__ == '__main__':
keep_sample_file = sys.argv[1]
plink_list_file = sys.argv[2]
with open(plink_list_file) as f:
for line in f:
b_file, chrome, = line.split()
cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
--keep {keep_sample_file} --make-bed --out 1000G.EUR.QC.{chrome}'''
print(cmd)
print('#' * 20)
print()
##/data/home/lingw/bin/plink --bfile /data/home/lingw/workSpace/reference/1000Genomes/raw_vcf/plink/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf --keep for-subset-sample.txt --make-bed --out 1000G.EUR.QC.chr10
python get_extract_sample.py for-subset-sample.txt 1kg-plink.txt > run.sh
nohup sh run.sh
#运行结果后check是否是503个样本的结果
wc -l 1000G.EUR.QC.chr22.fam
##plink
#准备文件1: sub-sample-plink-list.txt
ls 1000G.EUR.QC.chr*.bim|cut -f1-4 -d '.' > aa.txt
ls 1000G.EUR.QC.chr*.bim|cut -f4 -d '.' > bb.txt
#paste -d\\t aa.txt bb.txt
paste -d aa.txt bb.txt > sub-sample-plink-list.txt
#准备文件2: extract_snp.txt
touch
ll -rt
awk -F' ' '{print $3}' MDD_gwas_data.txt > extract_snp.txt #awk工具默认是空格,-F指定分隔符,
##keep sample
##extract snp
cat > get_extract_snp.py
import sys
import os
if __name__ == '__main__':
keep_snp_file = sys.argv[1]
plink_list_file = sys.argv[2]
with open(plink_list_file) as f:
for line in f:
b_file, chrome, = line.split()
cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
--extract {keep_snp_file} --make-bed --out 1000G.EUR.QC.{chrome}-sub-SNP'''
print(cmd)
print('#' * 20)
print()
python get_extract_snp.py extract_snp.txt sub-sample-plink-list.txt > run-extract-snp.sh
nohup sh ./run-extract-snp.sh &
#nohup 断网仍旧运行,&后台运行
jobs
##查看MDD_gwas_file中有多少snp,提取到的文件有多少snp
wc -l extract_snp.txt #8481297 extract_snp.txt
wc -l 1000G.EUR.QC.chr*-sub-SNP.bim #7829695 total
少了60多万snp属于正常,如提取出的只有几十万,丢失的就太多,需要重新提取。
原因是:1.1000G位点snpID不同 2.下载的snp-summary没有被1000G覆盖,1000G数据过老。一般不懂管,除非丢失过多。
#### filter LD information
ls 1000G.EUR.QC.chr*-sub-SNP.bim|cut -f1-3 -d '-' > aa.txt
ls 1000G.EUR.QC.chr*-sub-SNP.bim|cut -f4 -d '.' |cut -f1 -d '-' >bb.txt
paste aa.txt bb.txt > sub-sample-SNP-list.txt
(plink --make-bed 生成新的plink file;去掉make-bed就不生成新的plink file,就不会覆盖原来的。
allow no-sex必须加,有些sample的sex信息NA,不加一定会报错)
--indep-pairwise 50 5 0.8
import sys
import os
if __name__ == '__main__':
# keep_snp_file = sys.argv[1]
plink_list_file = sys.argv[1]
with open(plink_list_file) as f:
for line in f:
b_file, chrome, = line.split()
cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
--indep-pairwise 50 5 0.8 --out MDD.{chrome}_plink_prune_EUR_filtered_LD0.8'''
print(cmd)
print('#' * 20)
print()
python get_pruning.py sub-sample-SNP-list.txt > run-pruning.sh
wc -l MDD.chr7_plink_prune_EUR_filtered_LD0.8.prune.in MDD.chr7_plink_prune_EUR_filtered_LD0.8.prune.out
wc -l 1000G.EUR.QC.chr7-sub-SNP.bim
head MDD_EUR_LD0.8.prune
wc -l MDD_EUR_LD0.8.prune #2716381
标签:scPagwas,gwas,plink,--,snp,inhouse,file,txt,EUR
From: https://www.cnblogs.com/corrschi/p/18162813