Hail is an open-source, scalable framework for exploring and analyzing genomic data. See https://hail.is/docs/0.2/index.html for more information.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive -c 16 --mem 40g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ]$ module load hail [+] Loading hail 0.2.3 on cn3344 [+] Loading singularity on cn3344 [user@cn3144]$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/\ 20100804/ALL.2of4intersection.20100804.sites.vcf.gz [user@cn3144 ]$ ipython Python 3.6.7 (default, Oct 25 2018, 09:16:13) Type 'copyright', 'credits' or 'license' for more information IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: import hail as hl In [2]: hl.init() using hail jar at /usr/local/lib/python3.6/dist-packages/hail/hail-all-spark.jar Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Running on Apache Spark version 2.2.2 SparkUI available at http://10.2.9.172:4040 Welcome to __ __ <>__ / /_/ /__ __/ / / __ / _ `/ / / /_/ /_/\_,_/_/_/ version 0.2-a2eaf89baa0c LOGGING: writing to /spin1/scratch/teacher/hail-20181129-1008-0.2-a2eaf89baa0c.log In [4]: hl.import_vcf('ALL.2of4intersection.20100804.sites.vcf.gz',force_bgz=True).write('sample.vds') [Stage 1:===============================> (7 + 6) / 13]2018-11-29 15:10:40 Hail: INFO: Coerced sorted dataset [Stage 2:===========================================> (10 + 3) / 13]2018-11-29 15:11:47 Hail: INFO: wrote 25488488 items in 13 partitions to sample.vds [user@cn3144 ]$ exitsalloc.exe: Relinquishing job allocation 46116226
Run hail with jupyter notebook on single node:
[user@biowulf ]$sinteractive -c 16 --mem 40g --tunnel salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job Created 1 generic SSH tunnel(s) from this compute node to biowulf for your use at port numbers defined in the $PORTn ($PORT1, ...) environment variables. Please create a SSH tunnel from your workstation to these ports on biowulf. On Linux/MacOS, open a terminal and run: ssh -L 33327:localhost:33327 biowulf.nih.gov For Windows instructions, see https://hpc.nih.gov/docs/tunneling [user@cn3144]$ module load hail [user@cn3144]$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/\ 20100804/ALL.2of4intersection.20100804.sites.vcf.gz [user@cn3144]$ jupyter lab --ip localhost --port $PORT1 --no-browser [I 17:11:40.505 NotebookApp] Serving notebooks from local directory [I 17:11:40.505 NotebookApp] Jupyter Notebook 6.4.10 is running at: [I 17:11:40.505 NotebookApp] http://localhost:37859/?token=xxxxxxxx [I 17:11:40.506 NotebookApp] or http://127.0.0.1:37859/?token=xxxxxxx [I 17:11:40.506 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 17:11:40.512 NotebookApp] To access the notebook, open this file in a browser: file:///home/apptest1/.local/share/jupyter/runtime/nbserver-29841-open.html Or copy and paste one of these URLs: http://localhost:37859/?token=xxxxxxx or http://127.0.0.1:37859/?token=xxxxxxx
Then you can open a browser from your computer to connect to the jupyter notebook:
Create a python script (e.g. hail-script.py). For example:
#!/usr/bin/env python3 import hail as hl mt = hl.balding_nichols_model(n_populations=3, n_samples=500, n_variants=500_000, n_partitions=32) mt = mt.annotate_cols(drinks_coffee = hl.rand_bool(0.33)) gwas = hl.linear_regression_rows(y=mt.drinks_coffee, x=mt.GT.n_alt_alleles(), covariates=[1.0]) gwas.order_by(gwas.p_value).show(25)
Create a batch input file (e.g. hail.sh). For example:
#!/bin/bash module load hail python3-hail hail-script.py
Submit this job using the Slurm sbatch command.
sbatch -c 16 --mem 40g hail.sh
Create a swarmfile (e.g. hail.swarm). For example:
hail1.py hail2.py hail3.py
Submit this job using the swarm command.
swarm -f hail.swarm -g 30 -t 16 --module hailwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module hail | Loads the hail module for each subjob in the swarm |