hail简介
hail是一个开源的、通用的、面向python数据类型的处理基因数据专用的分析库和方法解决方案。
hail的存在是 为了 支持 多维度的复杂的数据结构,比如 全基因组关联数据研究(GWAS).
hail的底层是通过python,scala,java和apache spark来实现的。
维护hail的团队是 Neale lab,有相关疑问可以联系团队通过邮箱: [email protected]
hail是一个比较新的项目,也一直在进步和增加新特性,相关发布日志
谁在用hail
hail 在学术界和基因产业广泛被使用,比如
学术界也用hail来发布了很多文章。
相关参考
hail的安装
环境准备
1、jre8或者jdk1.8,必须是java8,不支持向上兼容,比如不支持9+,主要是因为内置的spark需要的java版本是8.
2、python3.6或者3.7,推荐Miniconda Python 3.7,不支持3.8
3、不论使用哪种方式,linux的系统内核需要包含最新版本的c和c++的标准包,比如:
GCC 5.0, and LLVM version 3.4 (which is Apple LLVM version 6.0) and later should install a compatible C++ standard library.
Most GNU/Linux distributions released since 2012 have a compatible C standard library
4、如果感觉自己的系统环境不是最新版本的gcc,可以更新g++包和下载lz4,我这里使用比较新的centos系统,不需要运行以下命令
Debian
sudo apt-get install g++ liblz4-dev
centos
sudo yum install gcc-c++ lz4 lz4-devel
Mac OS X
xcode-select --install
brew install lz4
安装conda命令行
详情参考文章
安装好后进入conda的环境中
假如conda的安装目录是/root/miniconda3,则里面有个activate文件。
/root/miniconda3/bin/activate
使用命令
chmod 777 /root/miniconda3/bin/activate
source /root/miniconda3/bin/activate
安装hail
使用命令
conda create -n hail python'>=3.6,<3.8'
conda activate hail
pip install hail
运行成功,完整输出如下:
(base) [root@spark-client-test conda]# conda activate hail
(hail) [root@spark-client-test conda]#
(hail) [root@spark-client-test conda]#
(hail) [root@spark-client-test conda]# pip install hail
Collecting hail
Downloading hail-0.2.32-py3-none-any.whl (32.5 MB)
|????????????????????????????????| 32.5 MB 3.7 MB/s
Collecting hurry.filesize==0.9
Downloading hurry.filesize-0.9.tar.gz (2.8 kB)
Collecting requests<2.21.1,>=2.21.0
Downloading requests-2.21.0-py2.py3-none-any.whl (57 kB)
|????????????????????????????????| 57 kB 3.7 MB/s
Collecting asyncinit<0.3,>=0.2.4
Downloading asyncinit-0.2.4-py3-none-any.whl (2.8 kB)
Collecting bokeh<1.3,>1.1
Downloading bokeh-1.2.0.tar.gz (17.6 MB)
|????????????????????????????????| 17.6 MB 3.4 MB/s
Collecting nest-asyncio
Downloading nest_asyncio-1.2.3-py3-none-any.whl (4.5 kB)
Collecting gcsfs==0.2.1
Downloading gcsfs-0.2.1.tar.gz (51 kB)
|????????????????????????????????| 51 kB 393 kB/s
Collecting tabulate==0.8.3
Downloading tabulate-0.8.3.tar.gz (46 kB)
|????????????????????????????????| 46 kB 4.1 MB/s
Collecting python-json-logger==0.1.11
Downloading python-json-logger-0.1.11.tar.gz (6.0 kB)
Collecting pyspark<2.4.2,>=2.4
Downloading pyspark-2.4.1.tar.gz (215.7 MB)
|????????????????????????????????| 215.7 MB 61 kB/s
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)")': /simple/numpy/
Collecting numpy<2
Downloading numpy-1.18.1-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
|????????????????????????????????| 20.1 MB 10.4 MB/s
Collecting parsimonious<0.9
Downloading parsimonious-0.8.1.tar.gz (45 kB)
|????????????????????????????????| 45 kB 5.1 MB/s
Collecting aiohttp<3.7,>=3.6
Downloading aiohttp-3.6.2-cp37-cp37m-manylinux1_x86_64.whl (1.2 MB)
|????????????????????????????????| 1.2 MB 15.0 MB/s
Collecting scipy<1.4,>1.2
Downloading scipy-1.3.3-cp37-cp37m-manylinux1_x86_64.whl (25.2 MB)
|????????????????????????????????| 25.2 MB 8.2 MB/s
Collecting decorator<5
Downloading decorator-4.4.1-py2.py3-none-any.whl (9.2 kB)
Collecting PyJWT
Downloading PyJWT-1.7.1-py2.py3-none-any.whl (18 kB)
Collecting pandas<0.26,>0.24
Downloading pandas-0.25.3-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
|????????????????????????????????| 10.4 MB 13.4 MB/s
Collecting aiohttp-session<2.8,>=2.7
Downloading aiohttp_session-2.7.0-py3-none-any.whl (14 kB)
Requirement already satisfied: setuptools in /root/miniconda3/envs/hail/lib/python3.7/site-packages (from hurry.filesize==0.9->hail) (45.2.0.post20200210)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/envs/hail/lib/python3.7/site-packages (from requests<2.21.1,>=2.21.0->hail) (2019.11.28)
Collecting chardet<3.1.0,>=3.0.2
Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
|????????????????????????????????| 133 kB 21.5 MB/s
Collecting idna<2.9,>=2.5
Downloading idna-2.8-py2.py3-none-any.whl (58 kB)
|????????????????????????????????| 58 kB 10.3 MB/s
Collecting urllib3<1.25,>=1.21.1
Downloading urllib3-1.24.3-py2.py3-none-any.whl (118 kB)
|????????????????????????????????| 118 kB 16.8 MB/s
Collecting six>=1.5.2
Downloading six-1.14.0-py2.py3-none-any.whl (10 kB)
Collecting PyYAML>=3.10
Downloading PyYAML-5.3.tar.gz (268 kB)
|????????????????????????????????| 268 kB 18.5 MB/s
Collecting python-dateutil>=2.1
Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
|????????????????????????????????| 227 kB 11.4 MB/s
Collecting Jinja2>=2.7
Downloading Jinja2-2.11.1-py2.py3-none-any.whl (126 kB)
|????????????????????????????????| 126 kB 11.8 MB/s
Collecting pillow>=4.0
Downloading Pillow-7.0.0-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
|????????????????????????????????| 2.1 MB 10.3 MB/s
Collecting packaging>=16.8
Downloading packaging-20.1-py2.py3-none-any.whl (36 kB)
Collecting tornado>=4.3
Downloading tornado-6.0.3.tar.gz (482 kB)
|????????????????????????????????| 482 kB 13.0 MB/s
Collecting google-auth>=1.2
Downloading google_auth-1.11.1-py2.py3-none-any.whl (76 kB)
|????????????????????????????????| 76 kB 6.7 MB/s
Collecting google-auth-oauthlib
Downloading google_auth_oauthlib-0.4.1-py2.py3-none-any.whl (18 kB)
Collecting py4j==0.10.7
Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)
|????????????????????????????????| 197 kB 13.0 MB/s
Collecting multidict<5.0,>=4.5
Downloading multidict-4.7.4-cp37-cp37m-manylinux1_x86_64.whl (149 kB)
|????????????????????????????????| 149 kB 13.0 MB/s
Collecting yarl<2.0,>=1.0
Downloading yarl-1.4.2-cp37-cp37m-manylinux1_x86_64.whl (256 kB)
|????????????????????????????????| 256 kB 9.8 MB/s
Collecting async-timeout<4.0,>=3.0
Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Collecting attrs>=17.3.0
Downloading attrs-19.3.0-py2.py3-none-any.whl (39 kB)
Collecting pytz>=2017.2
Downloading pytz-2019.3-py2.py3-none-any.whl (509 kB)
|????????????????????????????????| 509 kB 11.3 MB/s
Collecting MarkupSafe>=0.23
Downloading MarkupSafe-1.1.1-cp37-cp37m-manylinux1_x86_64.whl (27 kB)
Collecting pyparsing>=2.0.2
Downloading pyparsing-2.4.6-py2.py3-none-any.whl (67 kB)
|????????????????????????????????| 67 kB 8.1 MB/s
Collecting cachetools<5.0,>=2.0.0
Downloading cachetools-4.0.0-py3-none-any.whl (10 kB)
Collecting pyasn1-modules>=0.2.1
Downloading pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
|????????????????????????????????| 155 kB 13.0 MB/s
Collecting rsa<4.1,>=3.1.4
Downloading rsa-4.0-py2.py3-none-any.whl (38 kB)
Collecting requests-oauthlib>=0.7.0
Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting pyasn1<0.5.0,>=0.4.6
Downloading pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
|????????????????????????????????| 77 kB 7.3 MB/s
Collecting oauthlib>=3.0.0
Downloading oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
|????????????????????????????????| 147 kB 11.5 MB/s
Building wheels for collected packages: hurry.filesize, bokeh, gcsfs, tabulate, python-json-logger, pyspark, parsimonious, PyYAML, tornado
Building wheel for hurry.filesize (setup.py) ... done
Created wheel for hurry.filesize: filename=hurry.filesize-0.9-py3-none-any.whl size=4132 sha256=a6b6a98980ee1a921fc77fe3c6d76c26b53408afef398beb000dca534e5cc3cd
Stored in directory: /root/.cache/pip/wheels/2c/99/7f/8c88c372b4bd642a731232e63cb89467554f6cea7708574e49
Building wheel for bokeh (setup.py) ... done
Created wheel for bokeh: filename=bokeh-1.2.0-py3-none-any.whl size=8233357 sha256=fadb5f1812eb81dea3d26393057a6ab08f0b269bea4a021a5b280ecee8e25bda
Stored in directory: /root/.cache/pip/wheels/e9/2a/b6/0ffbf2eac54e489186db36a12cd6a2cebee9e6aba3dc8bff89
Building wheel for gcsfs (setup.py) ... done
Created wheel for gcsfs: filename=gcsfs-0.2.1-py3-none-any.whl size=27182 sha256=6f9b48133558c2bd1f06e85b8bca22db450be12cd509046648d904ccde891a9d
Stored in directory: /root/.cache/pip/wheels/87/f6/97/5a407bea9c29e37733a1db25076fb17ce4b6d863034f52f898
Building wheel for tabulate (setup.py) ... done
Created wheel for tabulate: filename=tabulate-0.8.3-py3-none-any.whl size=23378 sha256=01710251a1960d89f05787fdd3e4fd02afa7f15f55a9c7ad45468d4ae04c20a6
Stored in directory: /root/.cache/pip/wheels/b8/a2/a6/812a8a9735b090913e109133c7c20aaca4cf07e8e18837714f
Building wheel for python-json-logger (setup.py) ... done
Created wheel for python-json-logger: filename=python_json_logger-0.1.11-py2.py3-none-any.whl size=5075 sha256=f30f8962b78fdd72c7027842d3138bd3eb5750e9e5beae03456208d45709fc54
Stored in directory: /root/.cache/pip/wheels/fa/7f/fd/92ccdbb9d1a65486406e0363d2ba5b4ce52f400a915f602ecb
Building wheel for pyspark (setup.py) ... done
Created wheel for pyspark: filename=pyspark-2.4.1-py2.py3-none-any.whl size=216054604 sha256=240829ce909e1c8e8c57a2483e638733b5ca755bd3ae296023d74882d701bb63
Stored in directory: /root/.cache/pip/wheels/7e/9e/dd/d9be11f6d64e6d1d6c105e7e53e808f6d38788f0903e2bc72f
Building wheel for parsimonious (setup.py) ... done
Created wheel for parsimonious: filename=parsimonious-0.8.1-py3-none-any.whl size=42709 sha256=2b6693e8021d511f95f950bbdf4dddcf2f9e90462dacbf35e1837358e06b8836
Stored in directory: /root/.cache/pip/wheels/88/5d/ba/f27d8af07306b65ee44f9d3f9cadea1db749a421a6db8a99bf
Building wheel for PyYAML (setup.py) ... done
Created wheel for PyYAML: filename=PyYAML-5.3-cp37-cp37m-linux_x86_64.whl size=44228 sha256=f9136de58f310b1005c22edd7196dbf3a5590cd11af8a78d457f7390d59bbf67
Stored in directory: /root/.cache/pip/wheels/8a/55/a4/c0a81d27c33462cfdcb904db018f5550197e88b2b6b85beed2
Building wheel for tornado (setup.py) ... done
Created wheel for tornado: filename=tornado-6.0.3-cp37-cp37m-linux_x86_64.whl size=418877 sha256=2294c7dd506acb7cd16a3a81c0690de38155a36aa913fe23066602f18332d24d
Stored in directory: /root/.cache/pip/wheels/d0/31/2c/9406ed59f0dcdce0c453a8664124d738551590e74fc087f604
Successfully built hurry.filesize bokeh gcsfs tabulate python-json-logger pyspark parsimonious PyYAML tornado
Installing collected packages: hurry.filesize, chardet, idna, urllib3, requests, asyncinit, six, PyYAML, python-dateutil, MarkupSafe, Jinja2, numpy, pillow, pyparsing, packaging, tornado, bokeh, nest-asyncio, cachetools, pyasn1, pyasn1-modules, rsa, google-auth, oauthlib, requests-oauthlib, google-auth-oauthlib, decorator, gcsfs, tabulate, python-json-logger, py4j, pyspark, parsimonious, multidict, yarl, async-timeout, attrs, aiohttp, scipy, PyJWT, pytz, pandas, aiohttp-session, hail
Successfully installed Jinja2-2.11.1 MarkupSafe-1.1.1 PyJWT-1.7.1 PyYAML-5.3 aiohttp-3.6.2 aiohttp-session-2.7.0 async-timeout-3.0.1 asyncinit-0.2.4 attrs-19.3.0 bokeh-1.2.0 cachetools-4.0.0 chardet-3.0.4 decorator-4.4.1 gcsfs-0.2.1 google-auth-1.11.1 google-auth-oauthlib-0.4.1 hail-0.2.32 hurry.filesize-0.9 idna-2.8 multidict-4.7.4 nest-asyncio-1.2.3 numpy-1.18.1 oauthlib-3.1.0 packaging-20.1 pandas-0.25.3 parsimonious-0.8.1 pillow-7.0.0 py4j-0.10.7 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-2.4.6 pyspark-2.4.1 python-dateutil-2.8.1 python-json-logger-0.1.11 pytz-2019.3 requests-2.21.0 requests-oauthlib-1.3.0 rsa-4.0 scipy-1.3.3 six-1.14.0 tabulate-0.8.3 tornado-6.0.3 urllib3-1.24.3 yarl-1.4.2
(hail) [root@spark-client-test conda]#
如果报错找不到包或者版本不对应可以根据提示尝试升级pip命令
pip install --upgrade pip
再重新运行
pip install hail
从安装流程里可以看到 hail自带了一个pyspark。
启动和测试hail
安装成功后可以尝试运行hail
使用命令,首先要进入python环境
conda activate hail
python
import hail as hl
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt.count()
成功运行输出如下:
(hail) [root@spark-client-test conda]# python
Python 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
Initializing Spark and Hail with default parameters...
20/02/14 10:45:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.4.4
SparkUI available at http://spark-client-test:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.32-a5876a0a2853
LOGGING: writing to /spark/spark-2.4.4-bin-hadoop2.7/conda/hail-20200214-1045-0.2.32-a5876a0a2853.log
2020-02-14 10:45:13 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 100 variants...
>>> mt.count()
(100, 50)
关于spark
hail默认使用的是本地的spark,也就是local-mode模式的spark。
如有需要启用远程的spark集群,可以通过设置环境变量 PYSPARK_SUBMIT_ARGS来指定
如下:
export PYSPARK_SUBMIT_ARGS="--master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT} --deploy-mode client --jars /usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar,/spark/jars/hadoop-aws-2.7.6.jar,/spark/jars/aws-java-sdk-1.7.4.jar,/spark/jars/mongo-java-driver-3.4.2.jar,/spark/jars/mongo-spark-connector_2.11-2.2.0.jar --conf spark.driver.extraClassPath=/usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=/usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.instances=3 --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator --conf spark.kubernetes.container.image=123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/spark-py:vs245py37 --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.hadoop.fs.s3a.endpoint=https://s3.cn-northwest-1.amazonaws.com.cn --conf spark.hadoop.fs.s3a.access.key=123 --conf spark.hadoop.fs.s3a.secret.key=456 --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4 --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4
--conf spark.kubernetes.namespace=spark
--conf spark.driver.host=`hostname -i`
pyspark-shell
"
或者在 使用submit时指定参数如下:
HAIL_HOME=$(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')
spark-submit \
--jars $HAIL_HOME/hail-all-spark.jar \
--conf spark.driver.extraClassPath=$HAIL_HOME/hail-all-spark.jar \
--conf spark.executor.extraClassPath=./hail-all-spark.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
your-hail-python-script-here.py
扩大hail的内存
在linux的shell中使用命令
export PYSPARK_SUBMIT_ARGS="--driver-memory <YOUR_MEMORY_HERE>G pyspark-shell"
增加Annotation Database的信息
Annotation Database数据库是一个注释数据库,用于对接hail的分析流程。以下命令可以给我们的mt表格增加到 注释数据库中
使用命令
conda activate hail
python
import hail as hl
db = hl.experimental.DB()
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt = db.annotate_rows_db(mt)
mt.show()
数据集Datasets
hail内置了一些基因的数据集,可以给我们使用,使用命令load_dataset()
可以加载到
在python中使用
>>> # Load 1000 Genomes MatrixTable with GRCh38 coordinates
>>> mt_1kg = hl.experimental.load_dataset(name='1000_genomes', version='phase3', reference_genome='GRCh38')
包含了千人基因组数据集1000_genomes。
更多数据集详情参考 Datasets
数据类型Types
正常情况下,我们不需要关注数据类型,因为hail会自动帮我们针对数据的值进行分类。
但有几点需要注意
1、如果我们使用import_table()导入表格时,hail自动分类的数据类型不是我们想要的,我们可以指定它的数据类型。
2、使用literal()命令把python的值转换成 hail的表达式时不会返回数据类型
3、类型缺失的值使用 null()进行声明
以下记录一下跟数据类型相关的命令
conda activate hail
python
import hail as hl
hl.int32(3).dtype
输出结果
dtype('int32')
完整输出:
(base) [root@spark-client-test spark]# conda activate hail
(hail) [root@spark-client-test spark]# python
Python 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> hl.int32(3).dtype
dtype('int32')
>>>
当我们初始化一个值时,也会输出数据类型
>>> hl.int32(3)
<Int32Expression of type int32>
>>> hl.struct(name='Hail', dob=2015)
<StructExpression of type struct{name: str, dob: int32}>
>>> hl.struct(name='Hail', dob=2015).dtype
dtype('struct{name: str, dob: int32}')
需要注意的是
hail的sets和arrays 以及dicts 还有tuples 只能存同数据类型的值,区别与python。
python可以混合存储 比如 [‘1’, 2, 3.0] ,既有数值,也有字符串。
hail是不允许的。
字典类型的key和value的类型可以不同,但是 key和key之间,value和value之间也需要 相同数据类型。
例如:
{‘a’: 1, ‘b’: 2}
类型为
dict<str, int32>
hail还支持一种叫 tstruct 的数据类型。tstruct可以包含多种组合的数据类型,比如
hl.struct(name='Hail', dob=2015)
数据类型为
dtype('struct{name: str, dob: int32}')
其中name是字符串类型,dob是整型。
Structs在hail中是很常见的,比如如下:
>>> new_table = table1.annotate(table2_fields = table2[table1.key])
在table1中增加一个名字叫 table2_fields的字段
把增加了新字段的表 赋值给 new_table
table2_fields这个字段就是一个struct类型的数据,它包含了来自table2的 嵌套字段。
更多数据集详情参考Types
表达式Expressions
hail的表达式是一种对数据的简洁表达
在hail中,每一种数据类型type都有自己的表达式class。
比如 Int32Expression 代表 整型 32-bit integer。
BooleanExpression 代表 一个 布尔值 boolean value of True or False。
如下:
>>> hl.int32(5)
<Int32Expression of type int32>
>>> hl.bool(True)
<BooleanExpression of type bool>
同一表达式可以相加减乘除如下:
>>> hl.int32(5) + hl.int32(6)
<Int32Expression of type int32>
两个Int32Expression相加后 得到一个 新的Int32Expression的对象。
表达式是懒加载的,只有到它们被使用的时候 才会去计算里面的值。
我们可以通过python和hail的表达式对比来理解。如下。
在python中 表达式5+6会马上进行计算,我们进行表达式结果打印时,看到的是计算后的结果
>>> x = 5
>>> y = 6
>>> z = x + y
>>> z
11
hail的5+6表达式不会计算结果,而是把表达式保存起来,当使用的时候才会进行计算。如下
>>> x = hl.int32(5)
>>> y = hl.int32(6)
>>> z = x + y
>>> z
<Int32Expression of type int32>
>>> hl.eval(z)
11
>>> z.show()
+--------+
| <expr> |
+--------+
| int32 |
+--------+
| 11 |
+--------+
hail表达式在hail的tables和matrix tables中很常用,比如用于行列计算,新增字段
>>> ht2 = ht.annotate(C4 = ht.C3 + 3 * ht.C2 ** 2)
上面的代码新增一个字段C4,由ht表格中的C3和C2运算获得
hail的表达式不能用于 numpy和scipy。
hail的表达式还支持数组等类型如下:
>>> a = hl.array([1, 2, -3, 0, 5])
>>> a
<ArrayNumericExpression of type array<int32>>
>>> a.dtype
dtype('array<int32>')
>>> a[1]
<Int32Expression of type int32>
>>> a[1:-1]
<ArrayNumericExpression of type array<int32>>
>>> hl.literal([0,1,2])
<ArrayNumericExpression of type array<int32>>
逻辑表达
hail的逻辑表达跟python不一样
python使用 and,or,not
hail使用&,|,~
如下
>>> s1 = hl.int32(3) == 4
>>> s2 = hl.int32(3) != 4
>>> s1 & s2
<BooleanExpression of type bool>
>>> s1 | s2
<BooleanExpression of type bool>
>>> ~s1
<BooleanExpression of type bool>
>>> hl.eval(~s1)
True
if判断
python的if和else在hail中不能使用
hail使用cond(),case(),switch()
比如python代码
if (x > 0):
return 1
else:
return 0
需要转化为
>>> hl.cond(x > 0, 1, 0)
<Int32Expression of type int32>
这个表达式也可以用于运算
>>> a + hl.cond(x > 0, 1, 0)
<ArrayNumericExpression of type array<int32>>
case用法
实现
return 1 if x < -1, 2 if -1 <= x <= 2 and 3 if x > 2.
>>> (hl.case()
... .when(x < -1, 1)
... .when((x >= -1) & (x <= 2), 2)
... .when(x > 2, 3)
... .or_missing())
<Int32Expression of type int32>
or_missiong表示如果不在该范围则返回一个missing value
switch语句
>>> csq = hl.str('nonsense')
>>> (hl.switch(csq)
... .when("synonymous", False)
... .when("intron", False)
... .when("nonsense", True)
... .when("indel", True)
... .or_missing())
<BooleanExpression of type bool>
missingness语句
声明一个空的表达式
>>> hl.null('float64')
<Float64Expression of type float64>
>>> hl.cond(x > 2.0, x, hl.null(hl.tfloat))
<Float64Expression of type float64>
>>> cnull = hl.null('call')
>>> hl.eval(cnull.is_het())
None
更多用法可参考
表格Tables
hail的表格等同于sql的table,pandas的Dataframe,R的Dataframe,dyplr Tibble以及Spark Dataframe。
导入
hail支持各种数据源导入table,最常见的是TSV或者CSV文件
如下:
>>> ht = hl.import_table("data/kt_example1.tsv", impute=True)
很多用于基因分析的导入方法返回的都是tables,比如import_locus_intervals(), import_fam(), and import_bed()
table的内容显示如下
>>> ht.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
| ID | HT | SEX | X | Z | C1 | C2 | C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| 1 | 65 | "M" | 5 | 4 | 2 | 50 | 5 |
| 2 | 72 | "M" | 6 | 3 | 2 | 61 | 1 |
| 3 | 70 | "F" | 7 | 3 | 10 | 81 | -5 |
| 4 | 60 | "F" | 8 | 2 | 11 | 90 | -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+
全局字段
hail的表格除了跟普通表格一样有行列,还有全局字段,它是一列数据完全一样的 列。比如G字段是全局字段,内容是5.
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| ID | HT | SEX | X | Z | C1 | C2 | C3 | G |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| 1 | 65 | M | 5 | 4 | 2 | 50 | 5 | 5 |
| 2 | 72 | M | 6 | 3 | 2 | 61 | 1 | 5 |
| 3 | 70 | F | 7 | 3 | 10 | 81 | -5 | 5 |
| 4 | 60 | F | 8 | 2 | 11 | 90 | -10 | 5 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
不过 5这个值 只保存了一次 而不是每行都存一次。
可以使用describe查看。
>>> ht.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'ID': int32
'HT': int32
'SEX': str
'X': int32
'Z': int32
'C1': int32
'C2': int32
'C3': int32
----------------------------------------
Key:
None
----------------------------------------
关键字–Keys
行字段可以用table.key_by() 进行定义,在进行表格的joining时很重要
获取列内容
>>> ht
<hail.table.Table at 0x110791a20>
>>> ht.ID
<Int32Expression of type int32>
ht.G
ht['G']
更新字段
删除字段
>>> ht.drop('C1', 'C2')
>>> ht.drop(*['C1', 'C2'])
选中字段
>>> ht.select(ht.ID, ht.SEX)
>>> ht.select(*['ID', 'C3'])
过滤内容和新增字段
保留
>>> table_result = table1.filter(table1.C1 == 5)
去除
>>> table_result = table1.filter(table1.C1 == 10, keep=False)
新增字段
>>> ht_new = ht_new.annotate(id_times_2 = ht_new.ID * 2)
Aggregation
hail提供了一些方法可以运算行列得值
>>> ht.aggregate(hl.agg.fraction(ht.SEX == 'F'))
0.5
>>> ht_agg = (ht.group_by(ht.SEX)
... .aggregate(mean = hl.agg.mean(ht.HT)))
>>> ht_agg.show()
+-----+----------+
| SEX | mean |
+-----+----------+
| str | float64 |
+-----+----------+
| "F" | 6.50e+01 |
| "M" | 6.85e+01 |
+-----+----------+
joins
合并两个表,但需要设置key
如下:
>>> ht = ht.key_by('ID')
>>> ht2 = hl.import_table("data/kt_example2.tsv", impute=True).key_by('ID')
>>> ht_join = ht.join(ht2)
>>> ht_join.show(width=120)
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
| ID | HT | SEX | X | Z | C1 | C2 | C3 | A | B |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 | str |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
| 1 | 65 | "M" | 5 | 4 | 2 | 50 | 5 | 65 | "cat" |
| 2 | 72 | "M" | 6 | 3 | 2 | 61 | 1 | 72 | "dog" |
| 3 | 70 | "F" | 7 | 3 | 10 | 81 | -5 | 70 | "mouse" |
| 4 | 60 | "F" | 8 | 2 | 11 | 90 | -10 | 60 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
也可以把表2得列新增到表1中
>>> ht1 = ht.annotate(B = ht2[ht.ID].B)
>>> ht1.show(width=120)
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
| ID | HT | SEX | X | Z | C1 | C2 | C3 | B |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | str |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
| 1 | 65 | "M" | 5 | 4 | 2 | 50 | 5 | "cat" |
| 2 | 72 | "M" | 6 | 3 | 2 | 61 | 1 | "dog" |
| 3 | 70 | "F" | 7 | 3 | 10 | 81 | -5 | "mouse" |
| 4 | 60 | "F" | 8 | 2 | 11 | 90 | -10 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
table的基本方法
>>> first3 = ht.take(3)
>>> first3
[Struct(ID=1, HT=65, SEX='M', X=5, Z=4, C1=2, C2=50, C3=5),
Struct(ID=2, HT=72, SEX='M', X=6, Z=3, C1=2, C2=61, C3=1),
Struct(ID=3, HT=70, SEX='F', X=7, Z=3, C1=10, C2=81, C3=-5)]
>>> first3[0].ID
1
>>> first3[0]['ID']
1
Table.head()
Table.describe()
Table.count()
导出
导出TSV格式
Table.export()
用hail的格式文件写入硬盘保存
Table.write()
这些文件可以用read_table()读取。
导出pandas格式
Table.to_pandas()
导出spark格式pyspark
Table.to_spark()
复合表格MatrixTables
复合表格是有两个分布式延伸区域的表格
普通hail表格只有两部分,行字段和全局字段
但复合表格有四个部分
1、entry fields 行字段和列字段的索引
2、row fields 行字段
3、column fields 列字段
4、global fields 全局字段
操作上也会有些不一样
table有 Table.select() 和 Table.select_globals()
matrixTable有MatrixTable.select_rows(), MatrixTable.select_cols(), MatrixTable.select_entries(), and MatrixTable.select_globals().
以及 MatrixTable.rows() 和 MatrixTable.cols() 和 MatrixTable.entries()
复合表格的keys
获取
MatrixTable.row_key 和 MatrixTable.col_key
重新指定key
MatrixTable.key_rows_by()和 MatrixTable.key_cols_by()
获取字段
>>> mt
<hail.matrixtable.MatrixTable at 0x1107e54a8>
>>> mt.locus
<LocusExpression of type locus<GRCh37>>
>>> mt.DP.describe()
--------------------------------------------------------
Type:
int32
--------------------------------------------------------
Source:
<class 'hail.matrixtable.MatrixTable'>
Index:
['row', 'column']
--------------------------------------------------------
导入复合表格
常用import_matrix_table()
基因分析常用 import_vcf(), import_plink(), import_bgen(), 和 import_gen()
>>> mt = hl.import_vcf('data/sample.vcf.bgz')
>>> mt.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
NEGATIVE_TRAIN_SITE: bool,
AC: array<int32>,
...
DS: bool
}
----------------------------------------
Entry fields:
'GT': call
'AD': array<int32>
'DP': int32
'GQ': int32
'PL': array<int32>
----------------------------------------
Column key:
's': str
Row key:
'locus': locus<GRCh37>
'alleles': array<str>
----------------------------------------
常用操作
过滤
>>> filt_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
>>> filt_mt = mt.filter_cols(hl.agg.mean(mt.GQ) < 20)
>>> filt_mt = mt.filter_entries(mt.DP < 5)
增加字段
MatrixTable.annotate_globals()
MatrixTable.annotate_rows()
MatrixTable.annotate_cols()
MatrixTable.annotate_entries()
>>> mt_new = mt.annotate_globals(foo = 5)
>>> print(mt_new.globals.dtype.pretty())
struct {
foo: int32
}
>>> mt_new = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))
>>> mt_new = mt.annotate_entries(GT = hl.or_missing(mt.GQ >= 20, mt.GT))
选择
MatrixTable.select_globals()
MatrixTable.select_rows()
MatrixTable.select_cols()
MatrixTable.select_entries()
>>> mt_new = mt.select_entries('GT')
>>> print(mt_new.entry.dtype.pretty())
struct {
GT: call
}
>>> mt_new = mt.select_rows(**mt['info'])
>>> mt_new = mt.select_rows(AC = mt.info.AC,
... n_filters = hl.len(mt['filters']))
删除字段
>>> mt_new = mt.drop('GQ')
解开array和set
MatrixTable.explode_rows()
MatrixTable.explode_cols()
比如复制行
>>> mt_new = mt.annotate_rows(replicate_num = [1, 2])
>>> mt_new = mt_new.explode_rows(mt_new['replicate_num'])
>>> mt.count_rows()
346
>>> mt_new.count_rows()
692
>>> mt_new.replicate_num.show()
+---------------+------------+---------------+
| locus | alleles | replicate_num |
+---------------+------------+---------------+
| locus<GRCh37> | array<str> | int32 |
+---------------+------------+---------------+
| 20:10019093 | ["A","G"] | 1 |
| 20:10019093 | ["A","G"] | 2 |
| 20:10026348 | ["A","G"] | 1 |
| 20:10026348 | ["A","G"] | 2 |
| 20:10026357 | ["T","C"] | 1 |
| 20:10026357 | ["T","C"] | 2 |
| 20:10030188 | ["T","A"] | 1 |
| 20:10030188 | ["T","A"] | 2 |
| 20:10030452 | ["G","A"] | 1 |
| 20:10030452 | ["G","A"] | 2 |
+---------------+------------+---------------+
showing top 10 rows
Aggregation运算
MatrixTable.aggregate_rows()
MatrixTable.aggregate_cols()
MatrixTable.aggregate_entries()
>>> mt.aggregate_entries(hl.agg.mean(mt.GQ))
67.73196915777027
>>> mt.aggregate_entries((hl.agg.stats(mt.DP), hl.agg.stats(mt.GQ)))
(Struct(mean=41.83915800445897, stdev=41.93057654787303, min=0.0, max=450.0, n=34537, sum=1444998.9999999995),
Struct(mean=67.73196915777027, stdev=29.80840934057741, min=0.0, max=99.0, n=33720, sum=2283922.0000000135))
分组group-by
MatrixTable.group_rows_by()
MatrixTable.group_cols_by()
如下:
>>> mt_ann = mt.annotate_cols(case_status = hl.cond(hl.rand_bool(0.5),
... "CASE",
... "CONTROL"))
>>> mt_grouped = (mt_ann.group_cols_by(mt_ann.case_status)
... .aggregate(gq_stats = hl.agg.stats(mt_ann.GQ)))
>>> print(mt_grouped.entry.dtype.pretty())
struct {
gq_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64,
n: int64,
sum: float64
}
}
>>> print(mt_grouped.col.dtype)
struct{case_status: str}
joins
MatrixTable.union_cols()
MatrixTable.union_rows()
>>> mt_new = mt.annotate_rows(gnomad_ann = gnomad_data[mt.locus, mt.alleles])
>>> mt_new = mt.annotate_rows(gnomad_af = gnomad_data[mt.locus, mt.alleles]['AF'])
>>> mt_new = mt.annotate_rows(**gnomad_data[mt.locus, mt.alleles])
常用操作
MatrixTable.describe()
MatrixTable.head()
MatrixTable.sample()
MatrixTable.count_rows()
MatrixTable.count_cols()
导出
MatrixTable.write()导出的文件可以使用 read_matrix_table()读取