Biowulf High Performance Computing at the NIH
GLNEXUS on Biowulf

Scalable gVCF merging and joint variant calling for population sequencing projects.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ wget
[user@cn3144 ~]$ tar -xvf dv_platinum6_chr21_gvcf.tar
[user@cn3144 ~]$ rm dv_platinum6_chr21_gvcf.tar
[user@cn3144 ~]$ ls
[user@cn3144 ~]$ module load glnexus
[user@cn3144 ~]$ echo -e "chr21\t0\t48129895" > hg19_chr21.bed
[user@cn3144 ~]$ glnexus --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf
[31659] [2019-11-20 13:26:20.979] [GLnexus] [info] glnexus_cli v1.1.11-0-gf4ce0ff, built Mon Oct  7 17:10:41 2019
[31659] [2019-11-20 13:26:20.981] [GLnexus] [warning] jemalloc absent, which will impede performance with high thread counts. See
[31659] [2019-11-20 13:26:20.981] [GLnexus] [info] Loading config preset DeepVariant
[31659] [2019-11-20 13:26:20.986] [GLnexus] [info] config:
  drop_filtered: false
  min_allele_copy_number: 1
  min_AQ1: 0
  min_AQ2: 0
  min_GQ: 0
  max_alleles_per_site: 0
  monoallelic_sites_for_lost_alleles: true
  preference: common
  revise_genotypes: false
  min_assumed_allele_frequency: 0.0001
  required_dp: 0
  allow_partial_data: false
  allele_dp_format: AD
  ref_dp_format: MIN_DP
  output_residuals: false
  squeeze: false
  output_format: BCF
    - {orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
    - {orig_names: [AD], name: AD, description: "##FORMAT=", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false}
    - {orig_names: [GQ], name: GQ, description: "##FORMAT=", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
    - {orig_names: [PL], name: PL, description: "##FORMAT=", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true}
[31659] [2019-11-20 13:26:20.987] [GLnexus] [info] config CRC32C = 754566310
[31659] [2019-11-20 13:26:20.987] [GLnexus] [info] init database, exemplar_vcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz
[31659] [2019-11-20 13:26:21.139] [GLnexus] [info] Initialized GLnexus database in GLnexus.DB
[31659] [2019-11-20 13:26:21.139] [GLnexus] [info] bucket size: 30000
[31659] [2019-11-20 13:26:21.139] [GLnexus] [info] contigs: chrM chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1_gl000191_random chr1_gl000192_random chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random chr8_gl000197_random chr9_gl000198_random chr9_gl000199_random chr9_gl000200_random chr9_gl000201_random chr11_gl000202_random chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random chrUn_gl000211 chrUn_gl000212 chrUn_gl000213 chrUn_gl000214 chrUn_gl000215 chrUn_gl000216 chrUn_gl000217 chrUn_gl000218 chrUn_gl000219 chrUn_gl000220 chrUn_gl000221 chrUn_gl000222 chrUn_gl000223 chrUn_gl000224 chrUn_gl000225 chrUn_gl000226 chrUn_gl000227 chrUn_gl000228 chrUn_gl000229 chrUn_gl000230 chrUn_gl000231 chrUn_gl000232 chrUn_gl000233 chrUn_gl000234 chrUn_gl000235 chrUn_gl000236 chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240 chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 chrUn_gl000245 chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
[31659] [2019-11-20 13:26:21.155] [GLnexus] [info] db_get_contigs GLnexus.DB
[31659] [2019-11-20 13:26:21.218] [GLnexus] [info] Beginning bulk load with no range filter.
[31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Loaded 6 datasets with 6 samples; 239726128 bytes in 2572592 BCF records (10 duplicate) in 7062 buckets. Bucket max 551480 bytes, 5645 records. 0 BCF records skipped due to caller-specific exceptions
[31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Created sample set *@6
[31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Flushing and compacting database...
[31659] [2019-11-20 13:26:29.064] [GLnexus] [info] Bulk load complete!
[31659] [2019-11-20 13:26:29.089] [GLnexus] [info] found sample set *@6
[31659] [2019-11-20 13:26:29.089] [GLnexus] [info] discovering alleles in 1 range(s)
[31659] [2019-11-20 13:26:35.132] [GLnexus] [info] discovered 258742 alleles
[31659] [2019-11-20 13:26:35.964] [GLnexus] [info] unified to 124038 sites cleanly with 130071 ALT alleles. 140 ALT alleles were additionally included in monoallelic sites and 0 were filtered out on quality thresholds.
[31659] [2019-11-20 13:26:35.964] [GLnexus] [info] Lifting over 4 fields
[31659] [2019-11-20 13:26:35.987] [GLnexus] [info] found sample set *@6
[31659] [2019-11-20 13:26:48.314] [GLnexus] [info] genotyping complete!
[31659] [2019-11-20 13:26:48.314] [GLnexus] [info] worker threads were cumulatively stalled for 343210ms
[31659] [2019-11-20 13:26:48.314] [GLnexus] [info] Num BCF records read 4821531  query hits 771764
[user@cn3144 ~]$ ls
dv_platinum6_chr21.bcf	dv_platinum6_chr21_gvcf  GLnexus.DB  hg19_chr21.bed
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. For example:

set -e
module load glnexus
cd /data/user
glnexus --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#]
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. glnexus.swarm). For example:

glnexus --config DeepVariant --bed genomic_range1.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output1.bcf
glnexus --config DeepVariant --bed genomic_range2.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output2.bcf
glnexus --config DeepVariant --bed genomic_range3.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output3.bcf
glnexus --config DeepVariant --bed genomic_range4.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output4.bcf

Submit this job using the swarm command.

swarm -f glnexus.swarm [-g #] [-t #] --module glnexus
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module glnexus Loads the glnexus module for each subjob in the swarm