数据挖掘(五) -----基于Spark的可伸缩基因数据分析平台开源存储运算架构hail全面了解和安装

标签：int32 kB hail mt ----- hl 数据挖掘 spark

hail简介

hail是一个开源的、通用的、面向python数据类型的处理基因数据专用的分析库和方法解决方案。

hail的存在是为了支持多维度的复杂的数据结构，比如全基因组关联数据研究(GWAS).

GWAS Tutorial

hail的底层是通过python，scala，java和apache spark来实现的。

维护hail的团队是 Neale lab，有相关疑问可以联系团队通过邮箱: [email protected]

hail是一个比较新的项目，也一直在进步和增加新特性，相关发布日志

changelog

谁在用hail

hail 在学术界和基因产业广泛被使用，比如

genome aggregation database

UK Biobank rapid GWAS

学术界也用hail来发布了很多文章。

hail的安装

环境准备

1、jre8或者jdk1.8，必须是java8，不支持向上兼容，比如不支持9+，主要是因为内置的spark需要的java版本是8.

2、python3.6或者3.7，推荐Miniconda Python 3.7，不支持3.8

3、不论使用哪种方式，linux的系统内核需要包含最新版本的c和c++的标准包，比如:

GCC 5.0, and LLVM version 3.4 (which is Apple LLVM version 6.0) and later should install a compatible C++ standard library.

Most GNU/Linux distributions released since 2012 have a compatible C standard library

4、如果感觉自己的系统环境不是最新版本的gcc，可以更新g++包和下载lz4，我这里使用比较新的centos系统，不需要运行以下命令

Debian

sudo apt-get install g++ liblz4-dev

centos

sudo yum install gcc-c++ lz4 lz4-devel

Mac OS X

xcode-select --install
brew install lz4

安装conda命令行

详情参考文章

数据挖掘----基础–conda安装—miniconda

安装好后进入conda的环境中

假如conda的安装目录是/root/miniconda3,则里面有个activate文件。

/root/miniconda3/bin/activate

使用命令

chmod 777 /root/miniconda3/bin/activate
source /root/miniconda3/bin/activate

安装hail

使用命令

conda create -n hail python'>=3.6,<3.8'
conda activate hail
pip install hail

运行成功，完整输出如下:

(base) [root@spark-client-test conda]# conda activate hail
(hail) [root@spark-client-test conda]# 
(hail) [root@spark-client-test conda]# 
(hail) [root@spark-client-test conda]# pip install hail
Collecting hail
  Downloading hail-0.2.32-py3-none-any.whl (32.5 MB)
     |????????????????????????????????| 32.5 MB 3.7 MB/s 
Collecting hurry.filesize==0.9
  Downloading hurry.filesize-0.9.tar.gz (2.8 kB)
Collecting requests<2.21.1,>=2.21.0
  Downloading requests-2.21.0-py2.py3-none-any.whl (57 kB)
     |????????????????????????????????| 57 kB 3.7 MB/s 
Collecting asyncinit<0.3,>=0.2.4
  Downloading asyncinit-0.2.4-py3-none-any.whl (2.8 kB)
Collecting bokeh<1.3,>1.1
  Downloading bokeh-1.2.0.tar.gz (17.6 MB)
     |????????????????????????????????| 17.6 MB 3.4 MB/s 
Collecting nest-asyncio
  Downloading nest_asyncio-1.2.3-py3-none-any.whl (4.5 kB)
Collecting gcsfs==0.2.1
  Downloading gcsfs-0.2.1.tar.gz (51 kB)
     |????????????????????????????????| 51 kB 393 kB/s 
Collecting tabulate==0.8.3
  Downloading tabulate-0.8.3.tar.gz (46 kB)
     |????????????????????????????????| 46 kB 4.1 MB/s 
Collecting python-json-logger==0.1.11
  Downloading python-json-logger-0.1.11.tar.gz (6.0 kB)
Collecting pyspark<2.4.2,>=2.4
  Downloading pyspark-2.4.1.tar.gz (215.7 MB)
     |????????????????????????????????| 215.7 MB 61 kB/s 
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)")': /simple/numpy/
Collecting numpy<2
  Downloading numpy-1.18.1-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
     |????????????????????????????????| 20.1 MB 10.4 MB/s 
Collecting parsimonious<0.9
  Downloading parsimonious-0.8.1.tar.gz (45 kB)
     |????????????????????????????????| 45 kB 5.1 MB/s 
Collecting aiohttp<3.7,>=3.6
  Downloading aiohttp-3.6.2-cp37-cp37m-manylinux1_x86_64.whl (1.2 MB)
     |????????????????????????????????| 1.2 MB 15.0 MB/s 
Collecting scipy<1.4,>1.2
  Downloading scipy-1.3.3-cp37-cp37m-manylinux1_x86_64.whl (25.2 MB)
     |????????????????????????????????| 25.2 MB 8.2 MB/s 
Collecting decorator<5
  Downloading decorator-4.4.1-py2.py3-none-any.whl (9.2 kB)
Collecting PyJWT
  Downloading PyJWT-1.7.1-py2.py3-none-any.whl (18 kB)
Collecting pandas<0.26,>0.24
  Downloading pandas-0.25.3-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
     |????????????????????????????????| 10.4 MB 13.4 MB/s 
Collecting aiohttp-session<2.8,>=2.7
  Downloading aiohttp_session-2.7.0-py3-none-any.whl (14 kB)
Requirement already satisfied: setuptools in /root/miniconda3/envs/hail/lib/python3.7/site-packages (from hurry.filesize==0.9->hail) (45.2.0.post20200210)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/envs/hail/lib/python3.7/site-packages (from requests<2.21.1,>=2.21.0->hail) (2019.11.28)
Collecting chardet<3.1.0,>=3.0.2
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
     |????????????????????????????????| 133 kB 21.5 MB/s 
Collecting idna<2.9,>=2.5
  Downloading idna-2.8-py2.py3-none-any.whl (58 kB)
     |????????????????????????????????| 58 kB 10.3 MB/s 
Collecting urllib3<1.25,>=1.21.1
  Downloading urllib3-1.24.3-py2.py3-none-any.whl (118 kB)
     |????????????????????????????????| 118 kB 16.8 MB/s 
Collecting six>=1.5.2
  Downloading six-1.14.0-py2.py3-none-any.whl (10 kB)
Collecting PyYAML>=3.10
  Downloading PyYAML-5.3.tar.gz (268 kB)
     |????????????????????????????????| 268 kB 18.5 MB/s 
Collecting python-dateutil>=2.1
  Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
     |????????????????????????????????| 227 kB 11.4 MB/s 
Collecting Jinja2>=2.7
  Downloading Jinja2-2.11.1-py2.py3-none-any.whl (126 kB)
     |????????????????????????????????| 126 kB 11.8 MB/s 
Collecting pillow>=4.0
  Downloading Pillow-7.0.0-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
     |????????????????????????????????| 2.1 MB 10.3 MB/s 
Collecting packaging>=16.8
  Downloading packaging-20.1-py2.py3-none-any.whl (36 kB)
Collecting tornado>=4.3
  Downloading tornado-6.0.3.tar.gz (482 kB)
     |????????????????????????????????| 482 kB 13.0 MB/s 
Collecting google-auth>=1.2
  Downloading google_auth-1.11.1-py2.py3-none-any.whl (76 kB)
     |????????????????????????????????| 76 kB 6.7 MB/s 
Collecting google-auth-oauthlib
  Downloading google_auth_oauthlib-0.4.1-py2.py3-none-any.whl (18 kB)
Collecting py4j==0.10.7
  Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)
     |????????????????????????????????| 197 kB 13.0 MB/s 
Collecting multidict<5.0,>=4.5
  Downloading multidict-4.7.4-cp37-cp37m-manylinux1_x86_64.whl (149 kB)
     |????????????????????????????????| 149 kB 13.0 MB/s 
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.4.2-cp37-cp37m-manylinux1_x86_64.whl (256 kB)
     |????????????????????????????????| 256 kB 9.8 MB/s 
Collecting async-timeout<4.0,>=3.0
  Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Collecting attrs>=17.3.0
  Downloading attrs-19.3.0-py2.py3-none-any.whl (39 kB)
Collecting pytz>=2017.2
  Downloading pytz-2019.3-py2.py3-none-any.whl (509 kB)
     |????????????????????????????????| 509 kB 11.3 MB/s 
Collecting MarkupSafe>=0.23
  Downloading MarkupSafe-1.1.1-cp37-cp37m-manylinux1_x86_64.whl (27 kB)
Collecting pyparsing>=2.0.2
  Downloading pyparsing-2.4.6-py2.py3-none-any.whl (67 kB)
     |????????????????????????????????| 67 kB 8.1 MB/s 
Collecting cachetools<5.0,>=2.0.0
  Downloading cachetools-4.0.0-py3-none-any.whl (10 kB)
Collecting pyasn1-modules>=0.2.1
  Downloading pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
     |????????????????????????????????| 155 kB 13.0 MB/s 
Collecting rsa<4.1,>=3.1.4
  Downloading rsa-4.0-py2.py3-none-any.whl (38 kB)
Collecting requests-oauthlib>=0.7.0
  Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting pyasn1<0.5.0,>=0.4.6
  Downloading pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
     |????????????????????????????????| 77 kB 7.3 MB/s 
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
     |????????????????????????????????| 147 kB 11.5 MB/s 
Building wheels for collected packages: hurry.filesize, bokeh, gcsfs, tabulate, python-json-logger, pyspark, parsimonious, PyYAML, tornado
  Building wheel for hurry.filesize (setup.py) ... done
  Created wheel for hurry.filesize: filename=hurry.filesize-0.9-py3-none-any.whl size=4132 sha256=a6b6a98980ee1a921fc77fe3c6d76c26b53408afef398beb000dca534e5cc3cd
  Stored in directory: /root/.cache/pip/wheels/2c/99/7f/8c88c372b4bd642a731232e63cb89467554f6cea7708574e49
  Building wheel for bokeh (setup.py) ... done
  Created wheel for bokeh: filename=bokeh-1.2.0-py3-none-any.whl size=8233357 sha256=fadb5f1812eb81dea3d26393057a6ab08f0b269bea4a021a5b280ecee8e25bda
  Stored in directory: /root/.cache/pip/wheels/e9/2a/b6/0ffbf2eac54e489186db36a12cd6a2cebee9e6aba3dc8bff89
  Building wheel for gcsfs (setup.py) ... done
  Created wheel for gcsfs: filename=gcsfs-0.2.1-py3-none-any.whl size=27182 sha256=6f9b48133558c2bd1f06e85b8bca22db450be12cd509046648d904ccde891a9d
  Stored in directory: /root/.cache/pip/wheels/87/f6/97/5a407bea9c29e37733a1db25076fb17ce4b6d863034f52f898
  Building wheel for tabulate (setup.py) ... done
  Created wheel for tabulate: filename=tabulate-0.8.3-py3-none-any.whl size=23378 sha256=01710251a1960d89f05787fdd3e4fd02afa7f15f55a9c7ad45468d4ae04c20a6
  Stored in directory: /root/.cache/pip/wheels/b8/a2/a6/812a8a9735b090913e109133c7c20aaca4cf07e8e18837714f
  Building wheel for python-json-logger (setup.py) ... done
  Created wheel for python-json-logger: filename=python_json_logger-0.1.11-py2.py3-none-any.whl size=5075 sha256=f30f8962b78fdd72c7027842d3138bd3eb5750e9e5beae03456208d45709fc54
  Stored in directory: /root/.cache/pip/wheels/fa/7f/fd/92ccdbb9d1a65486406e0363d2ba5b4ce52f400a915f602ecb
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-2.4.1-py2.py3-none-any.whl size=216054604 sha256=240829ce909e1c8e8c57a2483e638733b5ca755bd3ae296023d74882d701bb63
  Stored in directory: /root/.cache/pip/wheels/7e/9e/dd/d9be11f6d64e6d1d6c105e7e53e808f6d38788f0903e2bc72f
  Building wheel for parsimonious (setup.py) ... done
  Created wheel for parsimonious: filename=parsimonious-0.8.1-py3-none-any.whl size=42709 sha256=2b6693e8021d511f95f950bbdf4dddcf2f9e90462dacbf35e1837358e06b8836
  Stored in directory: /root/.cache/pip/wheels/88/5d/ba/f27d8af07306b65ee44f9d3f9cadea1db749a421a6db8a99bf
  Building wheel for PyYAML (setup.py) ... done
  Created wheel for PyYAML: filename=PyYAML-5.3-cp37-cp37m-linux_x86_64.whl size=44228 sha256=f9136de58f310b1005c22edd7196dbf3a5590cd11af8a78d457f7390d59bbf67
  Stored in directory: /root/.cache/pip/wheels/8a/55/a4/c0a81d27c33462cfdcb904db018f5550197e88b2b6b85beed2
  Building wheel for tornado (setup.py) ... done
  Created wheel for tornado: filename=tornado-6.0.3-cp37-cp37m-linux_x86_64.whl size=418877 sha256=2294c7dd506acb7cd16a3a81c0690de38155a36aa913fe23066602f18332d24d
  Stored in directory: /root/.cache/pip/wheels/d0/31/2c/9406ed59f0dcdce0c453a8664124d738551590e74fc087f604
Successfully built hurry.filesize bokeh gcsfs tabulate python-json-logger pyspark parsimonious PyYAML tornado
Installing collected packages: hurry.filesize, chardet, idna, urllib3, requests, asyncinit, six, PyYAML, python-dateutil, MarkupSafe, Jinja2, numpy, pillow, pyparsing, packaging, tornado, bokeh, nest-asyncio, cachetools, pyasn1, pyasn1-modules, rsa, google-auth, oauthlib, requests-oauthlib, google-auth-oauthlib, decorator, gcsfs, tabulate, python-json-logger, py4j, pyspark, parsimonious, multidict, yarl, async-timeout, attrs, aiohttp, scipy, PyJWT, pytz, pandas, aiohttp-session, hail
Successfully installed Jinja2-2.11.1 MarkupSafe-1.1.1 PyJWT-1.7.1 PyYAML-5.3 aiohttp-3.6.2 aiohttp-session-2.7.0 async-timeout-3.0.1 asyncinit-0.2.4 attrs-19.3.0 bokeh-1.2.0 cachetools-4.0.0 chardet-3.0.4 decorator-4.4.1 gcsfs-0.2.1 google-auth-1.11.1 google-auth-oauthlib-0.4.1 hail-0.2.32 hurry.filesize-0.9 idna-2.8 multidict-4.7.4 nest-asyncio-1.2.3 numpy-1.18.1 oauthlib-3.1.0 packaging-20.1 pandas-0.25.3 parsimonious-0.8.1 pillow-7.0.0 py4j-0.10.7 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-2.4.6 pyspark-2.4.1 python-dateutil-2.8.1 python-json-logger-0.1.11 pytz-2019.3 requests-2.21.0 requests-oauthlib-1.3.0 rsa-4.0 scipy-1.3.3 six-1.14.0 tabulate-0.8.3 tornado-6.0.3 urllib3-1.24.3 yarl-1.4.2
(hail) [root@spark-client-test conda]#

如果报错找不到包或者版本不对应可以根据提示尝试升级pip命令

pip install --upgrade pip

再重新运行

pip install hail

从安装流程里可以看到 hail自带了一个pyspark。

启动和测试hail

安装成功后可以尝试运行hail

使用命令，首先要进入python环境

conda activate hail

python

import hail as hl
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt.count()

成功运行输出如下:

(hail) [root@spark-client-test conda]# python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
Initializing Spark and Hail with default parameters...
20/02/14 10:45:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.4.4
SparkUI available at http://spark-client-test:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.32-a5876a0a2853
LOGGING: writing to /spark/spark-2.4.4-bin-hadoop2.7/conda/hail-20200214-1045-0.2.32-a5876a0a2853.log
2020-02-14 10:45:13 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 100 variants...
>>> mt.count()
(100, 50)

关于spark

hail默认使用的是本地的spark，也就是local-mode模式的spark。

如有需要启用远程的spark集群，可以通过设置环境变量 PYSPARK_SUBMIT_ARGS来指定
如下:

export PYSPARK_SUBMIT_ARGS="--master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}   --deploy-mode client  --jars /usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar,/spark/jars/hadoop-aws-2.7.6.jar,/spark/jars/aws-java-sdk-1.7.4.jar,/spark/jars/mongo-java-driver-3.4.2.jar,/spark/jars/mongo-spark-connector_2.11-2.2.0.jar  --conf spark.driver.extraClassPath=/usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=/usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar  --conf spark.executor.instances=3  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator  --conf spark.kubernetes.container.image=123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/spark-py:vs245py37  --conf spark.kubernetes.submission.waitAppCompletion=false  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  --conf spark.hadoop.fs.s3a.fast.upload=true  --conf spark.hadoop.fs.s3a.endpoint=https://s3.cn-northwest-1.amazonaws.com.cn --conf spark.hadoop.fs.s3a.access.key=123  --conf spark.hadoop.fs.s3a.secret.key=456  --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4   --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4  
--conf spark.kubernetes.namespace=spark
--conf spark.driver.host=`hostname -i`
pyspark-shell 
"

或者在使用submit时指定参数如下:

HAIL_HOME=$(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')
spark-submit \
  --jars $HAIL_HOME/hail-all-spark.jar \
  --conf spark.driver.extraClassPath=$HAIL_HOME/hail-all-spark.jar \
  --conf spark.executor.extraClassPath=./hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
  your-hail-python-script-here.py

扩大hail的内存

在linux的shell中使用命令

export PYSPARK_SUBMIT_ARGS="--driver-memory <YOUR_MEMORY_HERE>G pyspark-shell"

增加Annotation Database的信息

Annotation Database数据库是一个注释数据库，用于对接hail的分析流程。以下命令可以给我们的mt表格增加到注释数据库中

使用命令

conda activate hail
python
import hail as hl
db = hl.experimental.DB()
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt = db.annotate_rows_db(mt)
mt.show()

数据集Datasets

hail内置了一些基因的数据集，可以给我们使用，使用命令load_dataset()
可以加载到
在python中使用

>>> # Load 1000 Genomes MatrixTable with GRCh38 coordinates
>>> mt_1kg = hl.experimental.load_dataset(name='1000_genomes',                                        version='phase3',                                      reference_genome='GRCh38')

包含了千人基因组数据集1000_genomes。

更多数据集详情参考 Datasets

数据类型Types

正常情况下，我们不需要关注数据类型，因为hail会自动帮我们针对数据的值进行分类。

但有几点需要注意

1、如果我们使用import_table()导入表格时，hail自动分类的数据类型不是我们想要的，我们可以指定它的数据类型。

2、使用literal()命令把python的值转换成 hail的表达式时不会返回数据类型

3、类型缺失的值使用 null()进行声明

以下记录一下跟数据类型相关的命令

conda activate hail
python
import hail as hl
hl.int32(3).dtype

输出结果

dtype('int32')

完整输出:

(base) [root@spark-client-test spark]# conda activate hail
(hail) [root@spark-client-test spark]# python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> hl.int32(3).dtype
dtype('int32')
>>>

当我们初始化一个值时，也会输出数据类型

>>> hl.int32(3)
<Int32Expression of type int32>

>>> hl.struct(name='Hail', dob=2015)
<StructExpression of type struct{name: str, dob: int32}>

>>> hl.struct(name='Hail', dob=2015).dtype
dtype('struct{name: str, dob: int32}')

需要注意的是

hail的sets和arrays 以及dicts 还有tuples 只能存同数据类型的值，区别与python。

python可以混合存储比如 [‘1’, 2, 3.0] ，既有数值，也有字符串。

hail是不允许的。

字典类型的key和value的类型可以不同，但是 key和key之间，value和value之间也需要相同数据类型。

例如:
{‘a’: 1, ‘b’: 2}

类型为

dict<str, int32>

hail还支持一种叫 tstruct 的数据类型。tstruct可以包含多种组合的数据类型，比如

hl.struct(name='Hail', dob=2015)

数据类型为

dtype('struct{name: str, dob: int32}')

其中name是字符串类型，dob是整型。

Structs在hail中是很常见的，比如如下:

>>> new_table = table1.annotate(table2_fields = table2[table1.key])

在table1中增加一个名字叫 table2_fields的字段
把增加了新字段的表赋值给 new_table
table2_fields这个字段就是一个struct类型的数据，它包含了来自table2的嵌套字段。

更多数据集详情参考Types

表达式Expressions

hail的表达式是一种对数据的简洁表达

在hail中，每一种数据类型type都有自己的表达式class。
比如 Int32Expression 代表整型 32-bit integer。
BooleanExpression 代表一个布尔值 boolean value of True or False。

如下:

>>> hl.int32(5)
<Int32Expression of type int32>

>>> hl.bool(True)
<BooleanExpression of type bool>

同一表达式可以相加减乘除如下:

>>> hl.int32(5) + hl.int32(6)
<Int32Expression of type int32>

两个Int32Expression相加后得到一个新的Int32Expression的对象。

表达式是懒加载的，只有到它们被使用的时候才会去计算里面的值。

我们可以通过python和hail的表达式对比来理解。如下。

在python中表达式5+6会马上进行计算，我们进行表达式结果打印时，看到的是计算后的结果

>>> x = 5
>>> y = 6
>>> z = x + y
>>> z
11

hail的5+6表达式不会计算结果，而是把表达式保存起来，当使用的时候才会进行计算。如下

>>> x = hl.int32(5)
>>> y = hl.int32(6)
>>> z = x + y
>>> z
<Int32Expression of type int32>

>>> hl.eval（z)
11
>>> z.show()
+--------+
| <expr> |
+--------+
|  int32 |
+--------+
|     11 |
+--------+

hail表达式在hail的tables和matrix tables中很常用，比如用于行列计算，新增字段

>>> ht2 = ht.annotate(C4 = ht.C3 + 3 * ht.C2 ** 2)

上面的代码新增一个字段C4，由ht表格中的C3和C2运算获得

hail的表达式不能用于 numpy和scipy。

hail的表达式还支持数组等类型如下：

>>> a = hl.array([1, 2, -3, 0, 5])
>>> a
<ArrayNumericExpression of type array<int32>>

>>> a.dtype
dtype('array<int32>')

>>> a[1]
<Int32Expression of type int32>


>>> a[1:-1]
<ArrayNumericExpression of type array<int32>>


>>> hl.literal([0,1,2])
<ArrayNumericExpression of type array<int32>>

逻辑表达

hail的逻辑表达跟python不一样

python使用 and,or,not
hail使用&，|，~

如下

>>> s1 = hl.int32(3) == 4
>>> s2 = hl.int32(3) != 4

>>> s1 & s2
<BooleanExpression of type bool>

>>> s1 | s2
<BooleanExpression of type bool>

>>> ~s1
<BooleanExpression of type bool>

>>> hl.eval（~s1)
True

if判断

python的if和else在hail中不能使用

hail使用cond(),case(),switch()

比如python代码

if (x > 0):
    return 1
else:
    return 0

需要转化为

>>> hl.cond(x > 0, 1, 0)
 <Int32Expression of type int32>

这个表达式也可以用于运算

>>> a + hl.cond(x > 0, 1, 0)
<ArrayNumericExpression of type array<int32>>

case用法

实现
return 1 if x < -1, 2 if -1 <= x <= 2 and 3 if x > 2.

>>> (hl.case()
...   .when(x < -1, 1)
...   .when((x >= -1) & (x <= 2), 2)
...   .when(x > 2, 3)
...   .or_missing())
<Int32Expression of type int32>

or_missiong表示如果不在该范围则返回一个missing value

switch语句

>>> csq = hl.str('nonsense')

>>> (hl.switch(csq)
...    .when("synonymous", False)
...    .when("intron", False)
...    .when("nonsense", True)
...    .when("indel", True)
...    .or_missing())
<BooleanExpression of type bool>

missingness语句

声明一个空的表达式

>>> hl.null('float64')
<Float64Expression of type float64>

>>> hl.cond(x > 2.0, x, hl.null(hl.tfloat))
<Float64Expression of type float64>


>>> cnull = hl.null('call')
>>> hl.eval（cnull.is_het())
None

更多用法可参考

表达式方法

表达式介绍

表格Tables

hail的表格等同于sql的table，pandas的Dataframe，R的Dataframe，dyplr Tibble以及Spark Dataframe。

导入

hail支持各种数据源导入table，最常见的是TSV或者CSV文件

如下:

>>> ht = hl.import_table("data/kt_example1.tsv", impute=True)

很多用于基因分析的导入方法返回的都是tables，比如import_locus_intervals(), import_fam(), and import_bed()

table的内容显示如下

>>> ht.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+

全局字段

hail的表格除了跟普通表格一样有行列，还有全局字段，它是一列数据完全一样的列。比如G字段是全局字段，内容是5.

+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |     G |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |     5 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |     5 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |     5 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |     5 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+

不过 5这个值只保存了一次而不是每行都存一次。

可以使用describe查看。

>>> ht.describe()  
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'ID': int32
    'HT': int32
    'SEX': str
    'X': int32
    'Z': int32
    'C1': int32
    'C2': int32
    'C3': int32
----------------------------------------
Key:
    None
----------------------------------------

关键字–Keys

行字段可以用table.key_by() 进行定义，在进行表格的joining时很重要

获取列内容

>>> ht  
<hail.table.Table at 0x110791a20>

>>> ht.ID  
<Int32Expression of type int32>

ht.G


ht['G']

更新字段

删除字段

>>> ht.drop('C1', 'C2')
>>> ht.drop(*['C1', 'C2'])

选中字段

>>> ht.select(ht.ID, ht.SEX)
>>> ht.select(*['ID', 'C3'])

过滤内容和新增字段

保留

>>> table_result = table1.filter(table1.C1 == 5)

去除

>>> table_result = table1.filter(table1.C1 == 10, keep=False)

新增字段

>>> ht_new = ht_new.annotate(id_times_2 = ht_new.ID * 2)

Aggregation

hail提供了一些方法可以运算行列得值

>>> ht.aggregate(hl.agg.fraction(ht.SEX == 'F'))
0.5

>>> ht_agg = (ht.group_by(ht.SEX)
...             .aggregate(mean = hl.agg.mean(ht.HT)))
>>> ht_agg.show()
+-----+----------+
| SEX |     mean |
+-----+----------+
| str |  float64 |
+-----+----------+
| "F" | 6.50e+01 |
| "M" | 6.85e+01 |
+-----+----------+

更多Aggregators用法

joins

合并两个表，但需要设置key

如下:

>>> ht = ht.key_by('ID')
>>> ht2 = hl.import_table("data/kt_example2.tsv", impute=True).key_by('ID')

>>> ht_join = ht.join(ht2)
>>> ht_join.show(width=120)
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |     A | B        |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 | str      |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 |    65 | "cat"    |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 |    72 | "dog"    |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 |    70 | "mouse"  |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 |    60 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+

也可以把表2得列新增到表1中

>>> ht1 = ht.annotate(B = ht2[ht.ID].B)
>>> ht1.show(width=120)
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 | B        |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | str      |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 | "cat"    |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 | "dog"    |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 | "mouse"  |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+

table的基本方法

>>> first3 = ht.take(3)
>>> first3
[Struct(ID=1, HT=65, SEX='M', X=5, Z=4, C1=2, C2=50, C3=5),
 Struct(ID=2, HT=72, SEX='M', X=6, Z=3, C1=2, C2=61, C3=1),
 Struct(ID=3, HT=70, SEX='F', X=7, Z=3, C1=10, C2=81, C3=-5)]
 
 
>>> first3[0].ID
1

>>> first3[0]['ID']
1

Table.head()

Table.describe()

Table.count()

导出

导出TSV格式

Table.export()

用hail的格式文件写入硬盘保存

Table.write()

这些文件可以用read_table（）读取。

导出pandas格式

Table.to_pandas()

导出spark格式pyspark

Table.to_spark()

表格介绍

复合表格MatrixTables

复合表格是有两个分布式延伸区域的表格

普通hail表格只有两部分，行字段和全局字段

但复合表格有四个部分

1、entry fields 行字段和列字段的索引
2、row fields 行字段
3、column fields 列字段
4、global fields 全局字段

操作上也会有些不一样
table有 Table.select() 和 Table.select_globals()

matrixTable有MatrixTable.select_rows(), MatrixTable.select_cols(), MatrixTable.select_entries(), and MatrixTable.select_globals().

以及 MatrixTable.rows() 和 MatrixTable.cols() 和 MatrixTable.entries()

复合表格的keys

获取
MatrixTable.row_key 和 MatrixTable.col_key

重新指定key
MatrixTable.key_rows_by()和 MatrixTable.key_cols_by()

获取字段

>>> mt  
<hail.matrixtable.MatrixTable at 0x1107e54a8>

>>> mt.locus  
<LocusExpression of type locus<GRCh37>>

>>> mt.DP.describe()  
--------------------------------------------------------
Type:
    int32
--------------------------------------------------------
Source:
    <class 'hail.matrixtable.MatrixTable'>
Index:
    ['row', 'column']
--------------------------------------------------------

导入复合表格

常用import_matrix_table()

基因分析常用 import_vcf(), import_plink(), import_bgen(), 和 import_gen()

>>> mt = hl.import_vcf('data/sample.vcf.bgz')

>>> mt.describe()  
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        NEGATIVE_TRAIN_SITE: bool,
        AC: array<int32>,
        ...
        DS: bool
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key:
    's': str
Row key:
    'locus': locus<GRCh37>
    'alleles': array<str>
----------------------------------------

常用操作

过滤

>>> filt_mt = mt.filter_rows(hl.len(mt.alleles) == 2)

>>> filt_mt = mt.filter_cols(hl.agg.mean(mt.GQ) < 20)

>>> filt_mt = mt.filter_entries(mt.DP < 5)

增加字段

MatrixTable.annotate_globals()
MatrixTable.annotate_rows()
MatrixTable.annotate_cols()
MatrixTable.annotate_entries()

>>> mt_new = mt.annotate_globals(foo = 5)
>>> print(mt_new.globals.dtype.pretty())
struct {
    foo: int32
}

>>> mt_new = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))

>>> mt_new = mt.annotate_entries(GT = hl.or_missing(mt.GQ >= 20, mt.GT))

选择

MatrixTable.select_globals()
MatrixTable.select_rows()
MatrixTable.select_cols()
MatrixTable.select_entries()

>>> mt_new = mt.select_entries('GT')
>>> print(mt_new.entry.dtype.pretty())
struct {
    GT: call
}

>>> mt_new = mt.select_rows(**mt['info']) 


>>> mt_new = mt.select_rows(AC = mt.info.AC,
...                         n_filters = hl.len(mt['filters']))

删除字段

>>> mt_new = mt.drop('GQ')

解开array和set
MatrixTable.explode_rows()
MatrixTable.explode_cols()

比如复制行

>>> mt_new = mt.annotate_rows(replicate_num = [1, 2])
>>> mt_new = mt_new.explode_rows(mt_new['replicate_num'])
>>> mt.count_rows()
346
>>> mt_new.count_rows()
692

>>> mt_new.replicate_num.show() 
+---------------+------------+---------------+
| locus         | alleles    | replicate_num |
+---------------+------------+---------------+
| locus<GRCh37> | array<str> |         int32 |
+---------------+------------+---------------+
| 20:10019093   | ["A","G"]  |             1 |
| 20:10019093   | ["A","G"]  |             2 |
| 20:10026348   | ["A","G"]  |             1 |
| 20:10026348   | ["A","G"]  |             2 |
| 20:10026357   | ["T","C"]  |             1 |
| 20:10026357   | ["T","C"]  |             2 |
| 20:10030188   | ["T","A"]  |             1 |
| 20:10030188   | ["T","A"]  |             2 |
| 20:10030452   | ["G","A"]  |             1 |
| 20:10030452   | ["G","A"]  |             2 |
+---------------+------------+---------------+
showing top 10 rows

Aggregation运算

MatrixTable.aggregate_rows()
MatrixTable.aggregate_cols()
MatrixTable.aggregate_entries()

>>> mt.aggregate_entries(hl.agg.mean(mt.GQ))  
67.73196915777027

>>> mt.aggregate_entries((hl.agg.stats(mt.DP), hl.agg.stats(mt.GQ)))  
(Struct(mean=41.83915800445897, stdev=41.93057654787303, min=0.0, max=450.0, n=34537, sum=1444998.9999999995),
Struct(mean=67.73196915777027, stdev=29.80840934057741, min=0.0, max=99.0, n=33720, sum=2283922.0000000135))

分组group-by

MatrixTable.group_rows_by()
MatrixTable.group_cols_by()

如下：

>>> mt_ann = mt.annotate_cols(case_status = hl.cond(hl.rand_bool(0.5),
...                                                 "CASE",
...                                                 "CONTROL"))


>>> mt_grouped = (mt_ann.group_cols_by(mt_ann.case_status)
...                 .aggregate(gq_stats = hl.agg.stats(mt_ann.GQ)))
>>> print(mt_grouped.entry.dtype.pretty())
struct {
    gq_stats: struct {
        mean: float64,
        stdev: float64,
        min: float64,
        max: float64,
        n: int64,
        sum: float64
    }
}
>>> print(mt_grouped.col.dtype)
struct{case_status: str}

joins

MatrixTable.union_cols()
MatrixTable.union_rows()

>>> mt_new = mt.annotate_rows(gnomad_ann = gnomad_data[mt.locus, mt.alleles])

>>> mt_new = mt.annotate_rows(gnomad_af = gnomad_data[mt.locus, mt.alleles]['AF'])

>>> mt_new = mt.annotate_rows(**gnomad_data[mt.locus, mt.alleles])

常用操作

MatrixTable.describe()
MatrixTable.head()
MatrixTable.sample()
MatrixTable.count_rows()
MatrixTable.count_cols()

导出

MatrixTable.write()导出的文件可以使用 read_matrix_table()读取

标签：int32,kB,hail,mt,-----,hl,数据挖掘,spark
From： https://blog.51cto.com/u_16218512/7013781

数据挖掘(五) -----基于Spark的可伸缩基因数据分析平台开源存储运算架构hail全面了解和安装

hail简介

谁在用hail

hail的安装

环境准备

安装conda命令行

安装hail

启动和测试hail

关于spark

扩大hail的内存

增加Annotation Database的信息

数据集Datasets

数据类型Types

表达式Expressions

表格Tables

导入

全局字段

关键字–Keys

更新字段

Aggregation

joins

table的基本方法

导出

复合表格MatrixTables

复合表格的keys

导入复合表格

常用操作

Aggregation运算

分组group-by

joins

导出

相关文章

赞助商

阅读排行