High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Shapeit on Biowulf & Helix

SHAPEIT is a fast and accurate method for estimation of haplotypes (aka phasing) from genotype or sequencing data.

SHAPEIT has primarily been developed by Dr Olivier Delaneau through a collaborative project between the research groups of Prof Jean-Francois Zagury at CNAM and Prof Jonathan Marchini at Oxford. Funding for this project has been received from several sources : CNAMPeptinovMRCLeverhulmeThe Wellcome Trust.

SHAPEIT has several notable features:

Running on Helix

$ cd /data/$USER/dir
$ module load shapeit
$ cp /usr/local/apps/shapeit/current/example/* .
$ shapeit --input-bed gwas.bed gwas.bim gwas.fam \
	--input-map genetic_map.txt \
	--output-max gwas.phased.haps gwas.phased.sample\
	--thread 4

Running a single batch job on Biowulf

1. Create a batch script along the following lines:

#!/bin/bash 


module load shapeit

cd /data/$USER
cp -r /usr/local/apps/shapeit/current/example .
shapeit -B gwas -M genetic_map.txt -O gwas.phased -T $SLURM_CPUS_PER_TASK

2. on the biowulf login node, submit the job. The number assigned to '--cpus-per-task' will be assigned to $SLURM_CPUS_PER_TASK in the script automatically:

$ sbatch --cpus-per-task=8 myscript 

If more momory is required (default 8x2=16gb), specify --mem=Mg, for example --mem=30g:

$ sbatch --cpus-per-task=8 --mem=30g jobscript

Running a swarm of batch jobs on Biowulf

1. Create a swarm file along the following lines:

cd /data/$USER/dir1; shapeit -B gwas -M genetic_map.txt -O gwas.phased -T $SLURM_CPUS_PER_TASK
cd /data/$USER/dir2; shapeit -B gwas -M genetic_map.txt -O gwas.phased -T $SLURM_CPUS_PER_TASK
cd /data/$USER/dir3; shapeit -B gwas -M genetic_map.txt -O gwas.phased -T $SLURM_CPUS_PER_TASK
[....]

Submit this swarm with:

swarm -t 8 -g 20 -f swarmfile --module shapeit

For more information regarding running swarm, see swarm.html

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive --cpus-per-task=8   
salloc.exe: Granted job allocation 16535

cn999$ module load shapeit
cn999$ cd /data/$USER/dir
cn999$ shapeit -B gwas -M genetic_map.txt -O gwas.phased -T $SLURM_CPUS_PER_TASK
[...etc...]

cn999$ exit
exit

biowulf$

Make sure to exit the job once you have finished your run.

Shapeit benchmark

The following benchmark was done using the example files and command mentioned above.

1 thread 727 sec
2 thread 455 sec
4 thread 252 sec
8 thread 188 sec
16 thread 108 sec

Documentation

https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html