首页 > 数据库 >cBioPortal 数据库 API 使用

cBioPortal 数据库 API 使用

时间:2024-05-26 10:30:04浏览次数:32  
标签:... cBioPortal 05 res 数据库 TCGA API tcga luad

cBioPortal 数据库 API 使用

文章目录

前言

虽然 cBioPortal 数据库提供了很多交互式可视化图表来展示和探索不同组学癌症数据集中的基因组信息。但是,我们想要进行更深入的分析,就需要自己下载数据并进行个性化分析

cBioPortal 并没有一键式批量下载所有数据集的功能,只能选择对应的研究(study)一步步下载。当我们需要分析的数据集涉及多种癌型或想要进行泛癌分析时,这种方式还是偏繁琐。

因此,cBioPortal 为我们提供了 REST API 接口,我们可以通过代码来进行批量下载。接口使用的是 Swagger/OpenAPI 规范,该规范主要是用于描述 REST API,既可以作为文档给开发者阅读,也可以让机器根据这个文档自动生成客户端代码等

API 文档的链接地址:https://www.cbioportal.org/api/api-docs

文档是 json 格式的,里面详细描述了各函数及参数的格式和使用方式

使用这个 API 的方式较多,包括 RPython

R
  1. cBioPortalData: 推荐使用这个包
  2. rapiclient: 也可以用这个包,解析 API 文档
library(rapiclient)
client <- get_api(url = "https://www.cbioportal.org/api/api-docs")
  1. CGDSR:不推荐使用,将会废弃
Python
  1. bravado
from bravado.client import SwaggerClient
cbioportal = SwaggerClient.from_url(
    'https://www.cbioportal.org/api/api-docs',
    config={ 
        "validate_requests": False,
        "validate_responses": False
    }
)

我们分别介绍 cBioPortalData 包和 bravado 包的使用。如果你使用的是其他编程语言,只要使用对应的能解析 OpenAPI 规范的 API 文档的包就行

cBioPortalData

安装导入

BiocManager::install("cBioPortalData")

library(cBioPortalData)

1. 数据结构

cBioPortalData 会将每一个 study 保存为一个 MultiAssayExperiment 结构的对象(S4

例如,我们使用 cBioDataPack 函数来下载和解析 luad_tcga 所包含的所有数据

luad <- cBioDataPack(cancer_study_id = "luad_tcga", ask = FALSE)

luad 就是一个 MultiAssayExperiment 对象

> class(luad)
[1] "MultiAssayExperiment"
attr(,"package")
[1] "MultiAssayExperiment"

我们可以查看该对象所包含的 slot

> slotNames(luad)
[1] "ExperimentList" "colData"        "sampleMap"      "drops"          "metadata" 

其中,最重要的是

  • ExperimentList: 包含了所有的实验检测的数据,如突变、拷贝数、表达等
> experiments(luad)
ExperimentList class object of length 17:
 [1] cna_hg19.seg: RaggedExperiment with 81799 rows and 518 columns
 [2] CNA: SummarizedExperiment with 24776 rows and 516 columns
 [3] expression_median: SummarizedExperiment with 17814 rows and 32 columns
 [4] linear_CNA: SummarizedExperiment with 24776 rows and 516 columns
 [5] methylation_hm27_normals: SummarizedExperiment with 1788 rows and 24 columns
 [6] methylation_hm27: SummarizedExperiment with 1788 rows and 126 columns
 [7] methylation_hm450_normals: SummarizedExperiment with 16556 rows and 32 columns
 [8] methylation_hm450: SummarizedExperiment with 16556 rows and 460 columns
 [9] mRNA_median_all_sample_Zscores: SummarizedExperiment with 17814 rows and 32 columns
 [10] mRNA_median_Zscores: SummarizedExperiment with 16617 rows and 32 columns
 [11] mutations_extended: RaggedExperiment with 72541 rows and 230 columns
 [12] mutations_mskcc: RaggedExperiment with 72541 rows and 230 columns
 [13] RNA_Seq_v2_expression_median: SummarizedExperiment with 20531 rows and 517 columns
 [14] RNA_Seq_v2_mRNA_median_all_sample_Zscores: SummarizedExperiment with 20531 rows and 517 columns
 [15] RNA_Seq_v2_mRNA_median_Zscores: SummarizedExperiment with 20440 rows and 517 columns
 [16] rppa_Zscores: SummarizedExperiment with 222 rows and 365 columns
 [17] rppa: SummarizedExperiment with 223 rows and 365 columns
  • colData: 样本信息
> colData(luad)[1:4, 1:5]
DataFrame with 4 rows and 5 columns
               PATIENT_ID       SAMPLE_ID        OTHER_SAMPLE_ID SPECIMEN_CURRENT_WEIGHT DAYS_TO_COLLECTION
              <character>     <character>            <character>             <character>        <character>
TCGA-05-4244 TCGA-05-4244 TCGA-05-4244-01 bac0b02d-ac3b-4784-b..         [Not Available]    [Not Available]
TCGA-05-4249 TCGA-05-4249 TCGA-05-4249-01 80f196fe-1eaf-40cb-a..         [Not Available]    [Not Available]
TCGA-05-4250 TCGA-05-4250 TCGA-05-4250-01 8f274178-7a8e-46b6-8..         [Not Available]    [Not Available]
TCGA-05-4382 TCGA-05-4382 TCGA-05-4382-01 cce6d71f-369e-467f-b..         [Not Available]    [Not Available]
  • sampleMap: 样本与实验数据之间的对应关系
> sampleMap(luad)
DataFrame with 5029 rows and 3 columns
            assay      primary         colname
         <factor>  <character>     <character>
1    cna_hg19.seg TCGA-05-4244 TCGA-05-4244-01
2    cna_hg19.seg TCGA-05-4249 TCGA-05-4249-01
3    cna_hg19.seg TCGA-05-4250 TCGA-05-4250-01
4    cna_hg19.seg TCGA-05-4382 TCGA-05-4382-01
5    cna_hg19.seg TCGA-05-4384 TCGA-05-4384-01
...           ...          ...             ...
5025         rppa TCGA-NJ-A55O TCGA-NJ-A55O-01
5026         rppa TCGA-NJ-A55R TCGA-NJ-A55R-01
5027         rppa TCGA-NJ-A7XG TCGA-NJ-A7XG-01
5028         rppa TCGA-O1-A52J TCGA-O1-A52J-01
5029         rppa TCGA-S2-AA1A TCGA-S2-AA1A-01

如何获取实验数据呢?方法有很多种,用 assays 可以返回所有实验数据列表

> exp.all <- assays(luad)
> names(exp.all)
 [1] "cna_hg19.seg"                              "CNA"                                      
 [3] "expression_median"                         "linear_CNA"                               
 [5] "methylation_hm27_normals"                  "methylation_hm27"                         
 [7] "methylation_hm450_normals"                 "methylation_hm450"                        
 [9] "mRNA_median_all_sample_Zscores"            "mRNA_median_Zscores"                      
[11] "mutations_extended"                        "mutations_mskcc"                          
[13] "RNA_Seq_v2_expression_median"              "RNA_Seq_v2_mRNA_median_all_sample_Zscores"
[15] "RNA_Seq_v2_mRNA_median_Zscores"            "rppa_Zscores"                             
[17] "rppa" 

获取 CNV 数据

> exp.all$CNA[1:5, 1:5]
        TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3                -1              -1              -1               0               0
ACTRT2               -1              -1              -1               0               0
AGRN                 -1              -1              -1               0               0
ANKRD65              -1              -1              -1               0               0
ATAD3A               -1              -1              -1               0               0

或者,使用 assay 函数并传递实验名称

> assay(luad, "CNA")[1:5, 1:5]
        TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3                -1              -1              -1               0               0
ACTRT2               -1              -1              -1               0               0
AGRN                 -1              -1              -1               0               0
ANKRD65              -1              -1              -1               0               0
ATAD3A               -1              -1              -1               0               0

具体的就不在这介绍了,我们重点关注的是数据的下载和解析

2. API

上面我们介绍的是直接下载和解析一个 study 的数据,但有时,我们关注的并不是一个 study 里面的数据,而是想看某一类型的数据在泛癌中的表现情况。

例如,想要统计所有 TCGAstudyCNA 的变异情况,如果下载所有的 study 数据然后提取对应的 CNA 数据,则会显得很麻烦很慢,所以我们要使用 REST API 来获取特定类型的数据

首先,初始化 API

cbio <- cBioPortal()
> class(cbio)
[1] "cBioPortal"
attr(,"package")
[1] "cBioPortalData"

使用 getStudies 获取所有 studyID

studies <- getStudies(cbio)
> head(studies)
# A tibble: 6 × 12
  name     description     publicStudy pmid   citation groups status importDate allSampleCount studyId cancerTypeId
  <chr>    <chr>           <lgl>       <chr>  <chr>    <chr>   <int> <chr>               <int> <chr>   <chr>       
1 Pan-Lun… "Whole-exome s… TRUE        27158… TCGA, N… ""          0 2021-04-0…           1144 nsclc_… nsclc       
2 Head an… "TCGA Head and… TRUE        NA     NA       "PUBL…      0 2021-04-2…            530 hnsc_t… hnsc        
3 Breast … "Whole-exome s… TRUE        26451… TCGA, C… "PUBL…      0 2021-04-2…            817 brca_t… brca        
4 Ovarian… "Whole exome s… TRUE        21720… TCGA, N… "PUBL…      0 2021-04-2…            489 ov_tcg… hgsoc       
5 Uterine… "Whole exome s… TRUE        23636… TCGA, N… "PUBL…      0 2021-04-2…            373 ucec_t… ucec        
6 Bladder… "Whole-exome s… TRUE        28988… Roberts… "PUBL…      0 2021-04-2…            413 blca_t… blca        
# … with 1 more variable: referenceGenome <chr>

或者,使用 cBioPortal 的对象方法

resp <- cbio$getAllStudiesUsingGET()

该方法返回的是 response 类型的对象,我们使用 httr::content 函数来解析

parsedResponse <- httr::content(resp)

返回的是列表,可以计算 study 的数量

> length(parsedResponse)
[1] 318
> dim(studies)
[1] 318  12

统计这些 study 所涉及的癌型和总样本数

> length(unique(studies$cancerTypeId))
[1] 94
> sum(studies$allSampleCount)
[1] 133449

你可能会有一个疑问,我是怎么知道 getStudies 函数的,我们可以使用 ls 来列出包中所有的函数

> ls("package:cBioPortalData")
 [1] "allSamples"            "cBioCache"             "cBioDataPack"          "cBioPortal"           
 [5] "cBioPortalData"        "clinicalData"          "downloadStudy"         "genePanelMolecular"   
 [9] "genePanels"            "geneTable"             "getDataByGenePanel"    "getGenePanel"         
[13] "getGenePanelMolecular" "getSampleInfo"         "getStudies"            "loadStudy"            
[17] "molecularData"         "molecularProfiles"     "mutationData"          "removeDataCache"      
[21] "removePackCache"       "sampleLists"           "samplesInSampleLists"  "searchOps"            
[25] "setCache"              "studiesTable"          "untarStudy" 

例如,获取 sampleListId

> sampleLists(cbio, studyId = "luad_tcga")
# A tibble: 11 × 5
   category                                      name             description              sampleListId     studyId
   <chr>                                         <chr>            <chr>                    <chr>            <chr>  
 1 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 2 all_cases_with_mutation_data                  Samples with mu… Samples with mutation d… luad_tcga_seque… luad_t…
 3 all_cases_with_rppa_data                      Samples with pr… Samples protein data (R… luad_tcga_rppa   luad_t…
 4 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 5 all_cases_with_mutation_and_cna_data          Samples with mu… Samples with mutation a… luad_tcga_cnaseq luad_t…
 6 all_cases_with_mrna_array_data                Samples with mR… Samples with mRNA expre… luad_tcga_mrna   luad_t…
 7 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 8 all_cases_in_study                            All samples      All samples (586 sample… luad_tcga_all    luad_t…
 9 all_cases_with_cna_data                       Samples with CN… Samples with CNA data (… luad_tcga_cna    luad_t…
10 all_cases_with_mrna_rnaseq_data               Samples with mR… Samples with mRNA expre… luad_tcga_rna_s… luad_t…
11 all_cases_with_mutation_and_cna_and_mrna_data Complete samples Samples with mutation, … luad_tcga_3way_… luad_t…

根据 sampleListId 获取样本名称

> samplesInSampleLists(cbio, sampleListIds = c("luad_tcga_cna", "luad_tcga_mrna"))
CharacterList of length 2
[["luad_tcga_cna"]] TCGA-05-4384-01 TCGA-05-4390-01 TCGA-05-4425-01 ... TCGA-80-5611-01 TCGA-83-5908-01
[["luad_tcga_mrna"]] TCGA-35-3615-01 TCGA-44-2655-01 TCGA-44-2656-01 ... TCGA-44-3919-01 TCGA-44-4112-01

获取分子谱 molecularProfileId

> molecularProfiles(cbio, studyId = "luad_tcga")$molecularProfileId
 [1] "luad_tcga_rppa"                                      "luad_tcga_rppa_Zscores"                             
 [3] "luad_tcga_gistic"                                    "luad_tcga_mrna"                                     
 [5] "luad_tcga_mrna_median_Zscores"                       "luad_tcga_rna_seq_v2_mrna"                          
 [7] "luad_tcga_rna_seq_v2_mrna_median_Zscores"            "luad_tcga_linear_CNA"                               
 [9] "luad_tcga_methylation_hm27"                          "luad_tcga_methylation_hm450"                        
[11] "luad_tcga_mutations"                                 "luad_tcga_rna_seq_v2_mrna_median_all_sample_Zscores"
[13] "luad_tcga_mrna_median_all_sample_Zscores" 

根据 molecularProfileId 获取对应的数据,需要传入相应的样本编号和基因 ID

exps <- molecularData(
  api = cbio,
  molecularProfileIds = c("luad_tcga_mrna"),
  entrezGeneIds = 1:1000,
  sampleIds = c("TCGA-35-3615-01", "TCGA-44-2655-01")
)

获取临床数据

> clinicalData(cbio, studyId = "luad_tcga")
# A tibble: 586 × 24
   uniqueSampleKey   uniquePatientKey   sampleId  patientId  studyId CANCER_TYPE  CANCER_TYPE_DET… FRACTION_GENOME…
   <chr>             <chr>              <chr>     <chr>      <chr>   <chr>        <chr>            <chr>           
 1 VENHQS0wNS00MjQ0… VENHQS0wNS00MjQ0O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.4565          
 2 VENHQS0wNS00MjQ1… VENHQS0wNS00MjQ1O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… NA              
 3 VENHQS0wNS00MjQ5… VENHQS0wNS00MjQ5O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2221          
 4 VENHQS0wNS00MjUw… VENHQS0wNS00MjUwO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2362          
 5 VENHQS0wNS00Mzgy… VENHQS0wNS00MzgyO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.0854          
 6 VENHQS0wNS00Mzg0… VENHQS0wNS00Mzg0O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.0661          
 7 VENHQS0wNS00Mzg5… VENHQS0wNS00Mzg5O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.4579          
 8 VENHQS0wNS00Mzkw… VENHQS0wNS00MzkwO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.3056          
 9 VENHQS0wNS00Mzk1… VENHQS0wNS00Mzk1O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.5598          
10 VENHQS0wNS00Mzk2… VENHQS0wNS00Mzk2O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2340          
# … with 576 more rows, and 16 more variables: IS_FFPE <chr>, LONGEST_DIMENSION <chr>, ONCOTREE_CODE <chr>,
#   OTHER_SAMPLE_ID <chr>, PATHOLOGY_REPORT_FILE_NAME <chr>, PATHOLOGY_REPORT_UUID <chr>, SAMPLE_TYPE <chr>,
#   SAMPLE_TYPE_ID <chr>, SHORTEST_DIMENSION <chr>, SOMATIC_STATUS <chr>, SPECIMEN_SECOND_LONGEST_DIMENSION <chr>,
#   VIAL_NUMBER <chr>, MUTATION_COUNT <chr>, DAYS_TO_COLLECTION <chr>, OCT_EMBEDDED <chr>,
#   SAMPLE_INITIAL_WEIGHT <chr>

3. 可视化

3.1 绘制 K-M 生存曲线
clinical <- colData(luad)
clinical[which(clinical$OS_MONTHS == "[Not Available]"), "OS_MONTHS"] <- NA
clinical$OS_MONTHS <- as.numeric(clinical$OS_MONTHS)

library(survival)
library(survminer)

fit <- survfit(
  Surv(OS_MONTHS, as.numeric(substr(OS_STATUS, 1, 1))) ~ SEX,
  data = clinical
)
ggsurvplot(fit, data = clinical, risk.table = TRUE)

3.2 展示样本数最多的 20 种癌型
library(tidyverse)

cancerTypeCounts <-                                
  studies %>%
  filter(cancerTypeId != "mixed") %>%
  group_by(cancerTypeId) %>%                       
  summarise(totalSamples=sum(allSampleCount)) %>%  
  arrange(desc(totalSamples)) %>%                  
  top_n(20) %>%
  mutate(cancerTypeId = toupper(cancerTypeId))

cancerTypeCounts %>%
  mutate(cancerTypeId = factor(cancerTypeId, levels = rev(cancerTypeId))) %>%
  ggplot(aes(x = cancerTypeId, fill = cancerTypeId)) +
  geom_col(aes(y = totalSamples), show.legend = FALSE) +
  coord_flip() +
  theme_classic()

3.3 展示突变频率最高的 20 基因
# 获取 luad_tcga 项目的所有突变数据
mutation <- cbio$getMutationsInMolecularProfileBySampleListIdUsingGET(
  molecularProfileId='luad_tcga_mutations',
  sampleListId='luad_tcga_all',
  entrezGeneId
  projection='DETAILED'
)
# 提取基因 Symbol 和样本 ID
res <- httr::content(mutation)
res.mat <- do.call(rbind, lapply(res, function (x) {
  return(data.frame(
      gene = x$gene$hugoGeneSymbol,
      sample = x$sampleId)
    )}
  )
)
# 绘制柱状图
group_by(res.mat, gene) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(20) %>%
  mutate(gene = factor(gene, levels = gene)) %>%
  ggplot(aes(gene, count)) +
  geom_col(fill = rainbow(20), show.legend = FALSE) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 0.7),
    panel.border = element_blank(),panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),axis.line = element_line(colour = "black")
  )

Python

Python 中,我们使用 bravado 库来连接 Open API/Swagger 规范的 REST API,使用这种规范的数据库还蛮多的,像 OncoKBGOGenome Nexus

安装

pip install --upgrade bravado

1. 基本使用

导入并创建对象

In [1]: from bravado.client import SwaggerClient

In [2]: cbioportal = SwaggerClient.from_url(
   ...:     'https://www.cbioportal.org/api/api-docs',
   ...:     config={
   ...:         "validate_requests": False,
   ...:         "validate_responses": False
   ...:     }
   ...: )

In [3]: type(cbioportal)
Out[3]: bravado.client.SwaggerClient

我们可以使用 Tab 键智能提示对象方法

也可以使用 dir 函数来列出所有属性

比如,我们想获取所有癌症类型,选择 Cancer_Types,然后再使用 Tab 键提示

可以看到两个对象方法,一个用于获取所有癌型,一个用于获取指定癌型

再使用 ? 来获取函数的文档

我们可以看到,getCancerTypeUsingGET 函数需要传递一个参数 cancerTypeId,而且传入的是字符串类型,而返回的是一个 bravado.http_future.HTTPFuture 对象,状态码为 200 表示查询成功

同样的,对于 getAllCancerTypesUsingGET 函数

5 个可选的参数,有相应的默认值,可以对返回结果排序或进行分页访问等

对于返回的 HTTPFuture 对象,我们需要调用 response() 方法来获取返回结果

In [4]: res = cbioportal.Cancer_Types.getAllCancerTypesUsingGET().response()

res 包含 3 个属性

result 属性保存了我们查询的所有结果列表

In [5]: type(res.result)
Out[5]: list
In [6]: res.result
Out[6]: [TypeOfCancer(cancerTypeId='aa', dedicatedColor='LightYellow', name='Aggressive Angiomyxoma', parent='soft_tissue', shortName='AA'),
 TypeOfCancer(cancerTypeId='aastr', dedicatedColor='Gray', name='Anaplastic Astrocytoma', parent='difg', shortName='AASTR'),
 TypeOfCancer(cancerTypeId='abc', dedicatedColor='LimeGreen', name='Activated B-cell Type', parent='dlbclnos', shortName='ABC'),
 TypeOfCancer(cancerTypeId='abl', dedicatedColor='LightSalmon', name='Acute Basophilic Leukemia', parent='amlnos', shortName='ABL'),
 TypeOfCancer(cancerTypeId='aca', dedicatedColor='Purple', name='Adrenocortical Adenoma', parent='adrenal_gland', shortName='ACA'),
 TypeOfCancer(cancerTypeId='acbc', dedicatedColor='HotPink', name='Adenoid Cystic Breast Cancer', parent='brca', shortName='ACBC'),
 ...]

使用 dir 来获取癌型类所包含的字段

In [7]: cancer_types = res.result

In [8]: len(cancer_types)
Out[8]: 869

In [9]: dir(cancer_types[0])
Out[9]: ['cancerTypeId', 'dedicatedColor', 'name', 'parent', 'shortName']

我们定义一个函数,用于将返回结果转换为 pd.DataFrame

In [10]: import pandas as pd
    ...: from collections import defaultdict

In [11]: def DataFrame(res):
    ...:     res_dict = defaultdict(list)
    ...:     for obj in res:
    ...:         for attr in dir(obj):
    ...:             res_dict[attr].append(obj[attr])
    ...:     return pd.DataFrame.from_dict(res_dict)
    ...: 
In [14]: DataFrame(cancer_types)
Out[14]: 
    cancerTypeId dedicatedColor                                  name         parent shortName
0             aa    LightYellow                Aggressive Angiomyxoma    soft_tissue        AA
1          aastr           Gray                Anaplastic Astrocytoma           difg     AASTR
2            abc      LimeGreen                 Activated B-cell Type       dlbclnos       ABC
3            abl    LightSalmon             Acute Basophilic Leukemia         amlnos       ABL
4            aca         Purple                Adrenocortical Adenoma  adrenal_gland       ACA
..           ...            ...                                   ...            ...       ...
864         wdls    LightYellow       Well-Differentiated Liposarcoma           lipo      WDLS
865         wdtc           Teal    Well-Differentiated Thyroid Cancer        thyroid      WDTC
866           wm      LimeGreen         Waldenstrom Macroglobulinemia            lpl        WM
867        wpscc           Blue  Warty Penile Squamous Cell Carcinoma           pscc     WPSCC
868           wt         Orange                          Wilms' Tumor         kidney        WT

[869 rows x 5 columns]

类似地,我们可以获取所有 study

res = cbioportal.Studies.getAllStudiesUsingGET().response()
studies = DataFrame(res.result)
获取基因信息
In [15]: res = cbioportal.Genes.getAllGenesUsingGET().response()

In [16]: gene_info = DataFrame(res.result)

In [17]: gene_info.query('hugoGeneSymbol == "TP53"')
Out[17]: 
        entrezGeneId geneticEntityId hugoGeneSymbol            type
113919          7157            None           TP53  protein-coding
获取样本信息
# 样本列表 ID
In [18]: res = cbioportal.Sample_Lists.getAllSampleListsInStudyUsingGET(studyId='acc_tcga').response()
    ...: sample_list = DataFrame(res.result)

# 所有样本
In [19]: res = cbioportal.Samples.getAllSamplesInStudyUsingGET(studyId='acc_tcga').response()
    ...: acc_samples = DataFrame(res.result)
获取临床信息
In [20]: res = cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId='acc_tcga').response()
    ...: clin_info = DataFrame(res.result)

In [21]: clin_info.head()
Out[21]: 
  clinicalAttribute      clinicalAttributeId  ...                   uniqueSampleKey                     value
0              None              CANCER_TYPE  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh  Adrenocortical Carcinoma
1              None     CANCER_TYPE_DETAILED  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh  Adrenocortical Carcinoma
2              None       DAYS_TO_COLLECTION  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh                      4691
3              None  FRACTION_GENOME_ALTERED  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh                    0.0585
4              None                  IS_FFPE  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh                        NO

[5 rows x 8 columns]

这些临床信息以长表的信息存储的,我们将其转换为宽表

In [22]: clin_long = clin_info[['clinicalAttributeId', 'sampleId', 'value']]

In [23]: clin_table = clin_long.pivot(index='clinicalAttributeId', columns='sampleId', values='value')

In [24]: clin_table.head()
Out[24]: 
sampleId                          TCGA-OR-A5J1-01           TCGA-OR-A5J2-01  ...           TCGA-PK-A5HB-01           TCGA-PK-A5HC-01
clinicalAttributeId                                                          ...                                                    
CANCER_TYPE              Adrenocortical Carcinoma  Adrenocortical Carcinoma  ...  Adrenocortical Carcinoma  Adrenocortical Carcinoma
CANCER_TYPE_DETAILED     Adrenocortical Carcinoma  Adrenocortical Carcinoma  ...  Adrenocortical Carcinoma  Adrenocortical Carcinoma
DAYS_TO_COLLECTION                           4691                      3124  ...                      3385                       359
FRACTION_GENOME_ALTERED                    0.0585                    0.4033  ...                    0.9279                    0.2901
IS_FFPE                                        NO                        NO  ...                        NO                        NO

[5 rows x 92 columns]
获取突变信息

先获取突变的分子谱 ID,一般是 studyId 加上 _mutations 后缀

In [25]: res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId='acc_tcga').response()
    ...: MP = DataFrame(res.result)

In [26]: MP.molecularProfileId
Out[26]: 
0                                        acc_tcga_rppa
1                                acc_tcga_rppa_Zscores
2                                      acc_tcga_gistic
3                             acc_tcga_rna_seq_v2_mrna
4              acc_tcga_rna_seq_v2_mrna_median_Zscores
5                                  acc_tcga_linear_CNA
6                           acc_tcga_methylation_hm450
7                                   acc_tcga_mutations
8    acc_tcga_rna_seq_v2_mrna_median_all_sample_Zsc...
Name: molecularProfileId, dtype: object

如果要获取所有样本的突变,可以传递以 _all 为后缀的 sampleListId

In [27]: studyId = 'acc_tcga'
    ...: res = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
    ...:     molecularProfileId = studyId + "_mutations",
    ...:     sampleListId = studyId + "_all",
    ...:     projection = 'DETAILED'
    ...: ).response()
    ...: mutations = DataFrame(res.result)
    
In [28]: mutations.head()
Out[28]: 
  alleleSpecificCopyNumber aminoAcidChange                                          center  ... validationStatus variantAllele variantType
0                     None            None                                   broad.mit.edu  ...         Untested             T         INS
1                     None            None                                    hgsc.bcm.edu  ...         Untested             T         SNP
2                     None            None  broad.mit.edu;ucsc.edu;bcgsc.ca;mdanderson.org  ...         Untested             T         SNP
3                     None            None  broad.mit.edu;ucsc.edu;bcgsc.ca;mdanderson.org  ...         Untested             T         SNP
4                     None            None                                   broad.mit.edu  ...         Untested             -         DEL

[5 rows x 40 columns]
获取 CNA 数据

获取拷贝数变异数据,需要注意一点,它的 molecularProfileId 并不是都以 gistic 结尾的,应该是流程不一样,有的是以 cna 结尾,例如

In [29]: res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId='nsclc_tcga_broad_2016').response()
    ...: MP = DataFrame(res.result)
    
In [30]: MP.molecularProfileId
Out[30]: 
0          nsclc_tcga_broad_2016_cna
1    nsclc_tcga_broad_2016_mutations
2       nsclc_tcga_broad_2016_fusion
Name: molecularProfileId, dtype: object

获取全部 CNA 数据

studyId = 'nsclc_tcga_broad_2016'
res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId=studyId).response()
MP = DataFrame(res.result)
molecularProfileId = MP.query('datatype == "DISCRETE"')['molecularProfileId'].values[0]

res = cbioportal.Discrete_Copy_Number_Alterations.getDiscreteCopyNumbersInMolecularProfileUsingGET(
    molecularProfileId = molecularProfileId,
    sampleListId = studyId + "_all",
    projection = 'DETAILED'
).response()
cna = DataFrame(res.result)
# 转换为宽矩阵型
cna_long = cna[['sampleId', 'entrezGeneId', 'alteration']]
cna_table = cna_long.pivot(index='entrezGeneId', columns='sampleId', values='alteration')

2. 可视化

获取完数据之后,可视化也就没什么难度了。但是我们还没介绍 Python 的可视化部分,这里就简单举个例子吧

我们使用的是 plotnine 模块,相当于 ggplo2Python 扩展,语法用起来几乎是一样的,这样看起来也就不会有什么难度了

2.1 样本的突变和拷贝数
from plotnine import *

studyId = 'acc_tcga'
# 拷贝数
res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId=studyId).response()
MP = DataFrame(res.result)
molecularProfileId = MP.query('datatype == "DISCRETE"')['molecularProfileId'].values[0]

res = cbioportal.Discrete_Copy_Number_Alterations.getDiscreteCopyNumbersInMolecularProfileUsingGET(
    molecularProfileId = molecularProfileId,
    sampleListId = studyId + "_all",
    projection = 'DETAILED'
).response()
cna = DataFrame(res.result)

cna_long = cna[['sampleId', 'entrezGeneId', 'alteration']]
cna_table = cna_long.pivot(index='entrezGeneId', columns='sampleId', values='alteration')
# 突变
res = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
    molecularProfileId = studyId + "_mutations",
    sampleListId = studyId + "_all",
    projection = 'DETAILED'
).response()
mutations = DataFrame(res.result)

# 统计样本的突变情况
sample_mut_count = mutations.groupby('sampleId').apply(lambda x: x.shape[0]).to_frame()
sample_mut_count.columns = ['mutation']
sample_mut_count = sample_mut_count.reset_index()
# 统计样本的 cna 情况
sample_cna_count = cna_table.count().to_frame()
sample_cna_count.columns = ['cna']
sample_cna_count = sample_cna_count.reset_index()
# 合并数据
mut_cna = pd.merge(sample_mut_count, sample_cna_count)

绘制图形
(
    ggplot(mut_cna)
    + geom_point(aes(x='cna', y='mutation', colour='sampleId'), 
                 show_legend=False, alpha=0.7)
    + theme_bw()
)

样本的突变和拷贝数变异好像有那么点像倒数函数曲线

2.2 样本占比
from collections import Counter

res = cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId='brca_tcga_pub').response()
clin_info = DataFrame(res.result)

clin_long = clin_info[['clinicalAttributeId', 'sampleId', 'value']]
clin_table = clin_long.pivot(index='clinicalAttributeId', columns='sampleId', values='value')
clin_table = clin_table.transpose()

df = clin_table[['ER_STATUS', 'HER2_STATUS', 'PR_STATUS']]
df = df.applymap(lambda x: x if x in ['Positive', 'Negative'] else 'NA')

(
    ggplot(df.melt()) 
    + geom_bar(aes('clinicalAttributeId', fill='value'))
    + scale_fill_manual(values=['#bebada', '#80b1d3', '#fb8072'])
    + theme_classic()
)

标签:...,cBioPortal,05,res,数据库,TCGA,API,tcga,luad
From: https://blog.csdn.net/dxs18459111694/article/details/139080439

相关文章

  • 京东JD商品sku信息API返回值挖掘:商业价值深度发现
    京东JD商品SKU(StockKeepingUnit,库存量单位)信息API的返回值挖掘可以为企业提供深度的商业价值发现。这些返回值通常包括商品的详细规格、价格、库存状态等关键信息,对于企业的商业决策、市场分析、用户体验等方面都有着重要的影响。首先,通过挖掘SKU信息API的返回值,企业可以获......
  • (免费领源码)Java/Mysql数据库+53102互联网美食分享平台,计算机毕业设计项目推荐上万套实
    springboot互联网互联网美食分享平台系   院XXXX学科门类XXX专   业 XXX班级XXX学   号XXX姓   名XXX指导菜谱大全 XXX菜谱大全职称XXX2023年2月摘 要大数据时代下,数据呈爆炸式地增长。为了迎合信息化时代的潮流和信息化......
  • (免费领源码)Java/Mysql数据库+53135高校大学生学科竞赛管理系统,计算机毕业设计项目推荐
    springboot高校大学生学科竞赛管理系统的设计与实现系   院XXXX学科门类XXX专   业 XXX班级XXX学   号XXX姓   名XXX2023年4月摘 要随着互联网趋势的到来,各行各业都在考虑利用互联网将自己推广出去,最好方式就是建立自己的互联......
  • (免费领源码)Java/Mysql数据库+53233基于SpringBoot的社区疫情防控系统,计算机毕业设计项
    springboot社区疫情防控管理系统摘 要信息化社会内需要与之针对性的信息获取途径,但是途径的扩展基本上为人们所努力的方向,由于站在的角度存在偏差,人们经常能够获得不同类型信息,这也是技术最为难以攻克的课题。针对社区疫情防控管理系统等问题,对社区疫情防控管理系统进行研......
  • APIO 2024 P3 爆标
    题目。关键想法:看成多项式,传点值。剩下在的等题出了再说。来了来了。可以做到\(n=94\)。提交记录。下面是做法。注:官方题解是\(n=4991\)。做法核心假设值域为\(m\)。结论:可以把\(n\)个值编码成\(2n\)个值,删\(n\)个元素后还能还原。做法:看成\(n-1\)次多项......
  • [集群聊天服务器]----(四)MySQL数据库模块
    上一节我们对网络模块和业务模块进行了解耦[集群聊天服务器]----(三)ChatService模块,解耦网络模块和业务模块,封装了回调函数。在项目中需要使用mysql数据库进行存储用户重要数据,并进行增删改查操作。,所以这一节我们对mysql的封装使用进行剖析,这里使用db类进行封装。数据库......
  • Vue3实战笔记(43)—Vue3组合式API下封装可复用ECharts图表组件
    文章目录前言一、封装echart图标钩子二、使用步骤总结前言接上文,已经安装好了ECharts,开始封装组件方便使用。一、封装echart图标钩子首先应用我们之前学习的钩子方式,在hooks目录下创建一个名为useECharts.js的文件,用于封装ECharts的逻辑:import{ref,onMo......
  • MySQL入门学习.数据库组成.触发器
        触发器是一种特殊的存储过程,它在特定的表上定义,并在特定的事件(如插入、更新或删除操作)发生时自动触发执行。触发器可以用于实现数据的完整性约束、日志记录、数据同步等功能。以下是一个简单的示例,展示了如何创建一个在插入数据时自动触发的触发器:CREATETRIGGERm......
  • 国产数据库TiDB的常用方法
    TiDB的常用方法主要涉及安装配置、数据操作、性能调优以及监控和维护等方面。以下是对这些常用方法的归纳和介绍:1.安装与配置安装TiDB:根据官方文档的指引,用户可以按照步骤进行TiDB的安装。配置TiDB:安装完成后,需要对TiDB的相关参数进行配置,包括端口号、数据目录等。这些配......
  • 数据库函数下拉式求和
    问题:如何用Dsum实现单条件求和的下拉函数解决:=DSUM($C$1:$E$9,D$1,$K$1:$K2)-SUM(L$1:L1)Dsum公式在第2行实现的是股票名称为A的求和结果;到第3行时变成股票名称为A和B的求和结果,这时需要减掉上一个单元格的数据;到第4行则需要减掉上两个单元格求和的数据。使用Sum(L$1:L1)......