scPagwas-gwas data pruning的处理-inhouse 【未完成整理】

时间：2024-04-27 23:58:32浏览次数：29

标签：scPagwas gwas plink -- snp inhouse file txt EUR

总共三个大步骤：
step1：提取503例EUR-Sample的1000G.EUR.QC.chr，通过python脚本批量跑plink得到
step2: 提取my-MDD中SNP的1000G.EUR.QC.chr-sub-chr,通过python脚本批量跑plink得到
step3: 进行pruning，得到MDD.chr*_plink_prune_EUR_filtered_LD0.8.prune.in，通过python脚本批量跑plink得到

且对每个python脚本，要提前准备好两个参数文件：1.keep_sample/SNP_file 2.plink_list_file [1列file_name，1列chr_name]

暂存版本

##样本Sample的提取 cat get_extract_sample.py
import sys
import os

if __name__ == '__main__':
    keep_sample_file = sys.argv[1]
    plink_list_file = sys.argv[2]
    
    with open(plink_list_file) as f:
        for line in f:
            b_file, chrome, = line.split()
            cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
                --keep {keep_sample_file} --make-bed --out 1000G.EUR.QC.{chrome}'''
            print(cmd)
            print('#' * 20)
            print()


##/data/home/lingw/bin/plink --bfile /data/home/lingw/workSpace/reference/1000Genomes/raw_vcf/plink/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf --keep for-subset-sample.txt --make-bed --out 1000G.EUR.QC.chr10

python get_extract_sample.py for-subset-sample.txt 1kg-plink.txt > run.sh

nohup sh run.sh

#运行结果后check是否是503个样本的结果
wc -l 1000G.EUR.QC.chr22.fam

##plink
#准备文件1: sub-sample-plink-list.txt
ls 1000G.EUR.QC.chr*.bim|cut -f1-4 -d '.' > aa.txt
ls 1000G.EUR.QC.chr*.bim|cut -f4 -d '.' > bb.txt
#paste  -d\\t aa.txt bb.txt 
paste  -d aa.txt bb.txt  > sub-sample-plink-list.txt

#准备文件2: extract_snp.txt

touch
ll -rt

awk -F' ' '{print $3}' MDD_gwas_data.txt > extract_snp.txt #awk工具默认是空格，-F指定分隔符，

##keep sample
##extract snp

cat > get_extract_snp.py
import sys
import os

if __name__ == '__main__':
    keep_snp_file = sys.argv[1]
    plink_list_file = sys.argv[2]
    
    with open(plink_list_file) as f:
        for line in f:
            b_file, chrome, = line.split()
            cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
                --extract {keep_snp_file} --make-bed --out 1000G.EUR.QC.{chrome}-sub-SNP'''
            print(cmd)
            print('#' * 20)
            print()


python get_extract_snp.py extract_snp.txt sub-sample-plink-list.txt > run-extract-snp.sh

nohup sh ./run-extract-snp.sh &
#nohup 断网仍旧运行,&后台运行
jobs

##查看MDD_gwas_file中有多少snp，提取到的文件有多少snp
wc -l extract_snp.txt  #8481297 extract_snp.txt
wc -l 1000G.EUR.QC.chr*-sub-SNP.bim  #7829695 total

少了60多万snp属于正常，如提取出的只有几十万，丢失的就太多，需要重新提取。
原因是：1.1000G位点snpID不同 2.下载的snp-summary没有被1000G覆盖，1000G数据过老。一般不懂管，除非丢失过多。


#### filter LD information
ls 1000G.EUR.QC.chr*-sub-SNP.bim|cut -f1-3 -d '-' > aa.txt
ls 1000G.EUR.QC.chr*-sub-SNP.bim|cut -f4 -d '.' |cut -f1 -d '-' >bb.txt
paste aa.txt bb.txt  > sub-sample-SNP-list.txt

(plink --make-bed 生成新的plink file;去掉make-bed就不生成新的plink file，就不会覆盖原来的。
 allow no-sex必须加，有些sample的sex信息NA，不加一定会报错)

--indep-pairwise 50 5 0.8 


import sys
import os

if __name__ == '__main__':
    # keep_snp_file = sys.argv[1]
    plink_list_file = sys.argv[1]
    
    with open(plink_list_file) as f:
        for line in f:
            b_file, chrome, = line.split()
            cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
                --indep-pairwise 50 5 0.8  --out MDD.{chrome}_plink_prune_EUR_filtered_LD0.8'''
            print(cmd)
            print('#' * 20)
            print()

python get_pruning.py sub-sample-SNP-list.txt > run-pruning.sh

wc -l MDD.chr7_plink_prune_EUR_filtered_LD0.8.prune.in MDD.chr7_plink_prune_EUR_filtered_LD0.8.prune.out

wc -l 1000G.EUR.QC.chr7-sub-SNP.bim

head MDD_EUR_LD0.8.prune
wc -l MDD_EUR_LD0.8.prune  #2716381

标签：scPagwas,gwas,plink,--,snp,inhouse,file,txt,EUR
From： https://www.cnblogs.com/corrschi/p/18162813

GWAS + 选择进化代码
library(CMplot)library(tidyverse)fst=choose.files()pi=choose.files()fst1=read.table(fst,header=T)head(fst1)fst2=fst1%>%select(1,2,3,6)%>%top_frac(0.05,wt=MEAN_FST)head(fst2)write.table(fst2,"fst_vasuclar.txt",qu......
GWAS：plink进行meta分析
之前教程提到过Metal是可以做Meta分析，除了Metal，PLINK也可以进行Meta分析。命令如下所示：plink--meta-analysisgwas1.plinkgwas2.plinkgwas3.plink+logscaleqt--meta-analysis-snp-fieldSNP--meta-analysis-chr-fieldCHR--meta-analysis-bp-fieldBP--meta-analysis......
GWAS软件包：GAPIT3它来啦
GAPIT是一款非常老的而且非常流行的软件包，傻瓜式操作，一键出图出结果，一篮子的解决方案，是我最经常使用的GWAS分析软件包。最近，GAPIT现在的版本是GAPIT3，速度比第二版有较大的提升：更大的变化，终于有GAPIT这个软件包了，可以用library载入进去，而且安装方式可以用github安装，更符合R-style。1......
GWAS数据库
NHGRI-EBIGWAS数据库:https://www.ebi.ac.uk/gwas/描述：由美国国家人类基因组研究所（NHGRI）和欧洲生物信息研究所（EBI）合作建立的GWAS数据库，提供了公开可访问的GWAS关联结果和相关信息。GRASP:http://grasp.nhlbi.nih.gov/Overview.aspx描述：由美国国家心脏、肺部和血液......
GWAS：表型的标准化（the normalization of phenotype）
GWAS表型的标准化方法一般有Quantilenormalization、Inverseranknormalization、Z-scorenormalization等。各自区别如下：一、Quantilenormalization该方法将每个样本中表型值进行排序，然后将其规范化到一个标准分布，通常是正态分布。规范化是通过将每个样本的分布等同于目标......
post-GWAS: transcriptome-wide association studies (TWAS) 结果解读
Thetoppanelshowsallofthegenesinthelocus.ThemarginallyTWASassociatedgenesarehighlightedinblue,andthosethatarejointlysignificant(inthiscase,FAM109B)highlightedingreen.Thestatisticsforthejointlysignificantgenesarerepo......
R语言实现GWAS结果显著SNP位点归类提取与变异类型转化
GWAS结果显著SNP位点归类提取与变异类型转化根据GWAS得到的Rresult文件信息，能够找出每个snp位点对应的显著性情况和基因变异信息，接下来，需要根据表格中的信息进行归纳总结，对不同显著性层次进行区分，找出可能性最大的点，过程比较繁琐。这里笔者分享一个算法，使统计SNP和变异类型变的......
GWAS：mtag (Multi-Trait Analysis of GWAS) 分析
mtag(Multi-TraitAnalysisofGWAS)作用：通过对多个表型相似的GWASsummary结果进行联合分析，发现更多的表型相关基因座。以抑郁症状、神经质和主观幸福感这三个表型为例，......
GWAS中的effective sample size
Forcontinuoustraits,theeffectivesamplesizeisthetotalsamplesize;Forbinarytraits,theeffectivesamplesizeisNcase*Ncontrol/(Ncase+Ncontrol).出......
Post-GWAS: single-cell disease relevance score (scDRS) 分析
1、scDRS的计算原理如下所示：图片来源：ZhangMJ,HouK,DeyKK,etal.Polygenicenrichmentdistinguishesdiseaseassociationsofindividualcellsinsingle-ce......

scPagwas-gwas data pruning的处理-inhouse 【未完成整理】

暂存版本

相关文章

赞助商

阅读排行