GLNEXUS on Biowulf

Scalable gVCF merging and joint variant calling for population sequencing projects.


Interactive job
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ wget
[user@cn3144 ~]$ tar -xvf dv_platinum6_chr21_gvcf.tar
[user@cn3144 ~]$ rm dv_platinum6_chr21_gvcf.tar
[user@cn3144 ~]$ ls
[user@cn3144 ~]$ module load glnexus
[user@cn3144 ~]$ echo -e "chr21\t0\t48129895" > hg19_chr21.bed
[user@cn3144 ~]$ glnexus --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf
[31659] [2019-11-20 13:26:20.979] [GLnexus] [info] glnexus_cli v1.1.11-0-gf4ce0ff, built Mon Oct  7 17:10:41 2019
[31659] [2019-11-20 13:26:20.981] [GLnexus] [warning] jemalloc absent, which will impede performance with high thread counts. See
[31659] [2019-11-20 13:26:20.981] [GLnexus] [info] Loading config preset DeepVariant
[31659] [2019-11-20 13:26:20.986] [GLnexus] [info] config:
  drop_filtered: false
  min_allele_copy_number: 1
  min_AQ1: 0
  min_AQ2: 0
  min_GQ: 0
  max_alleles_per_site: 0
  monoallelic_sites_for_lost_alleles: true
  preference: common
  revise_genotypes: false
  min_assumed_allele_frequency: 0.0001
  required_dp: 0
  allow_partial_data: false
  allele_dp_format: AD
  ref_dp_format: MIN_DP
  output_residuals: false
  squeeze: false
  output_format: BCF
    - {orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
    - {orig_names: [AD], name: AD, description: "##FORMAT=", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false}
    - {orig_names: [GQ], name: GQ, description: "##FORMAT=", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
    - {orig_names: [PL], name: PL, description: "##FORMAT=", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true}
[31659] [2019-11-20 13:26:20.987] [GLnexus] [info] config CRC32C = 754566310
[31659] [2019-11-20 13:26:20.987] [GLnexus] [info] init database, exemplar_vcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz
[31659] [2019-11-20 13:26:21.139] [GLnexus] [info] Initialized GLnexus database in GLnexus.DB
[31659] [2019-11-20 13:26:21.139] [GLnexus] [info] bucket size: 30000
[31659] [2019-11-20 13:26:21.139] [GLnexus] [info] contigs: chrM chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1_gl000191_random chr1_gl000192_random chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random chr8_gl000197_random chr9_gl000198_random chr9_gl000199_random chr9_gl000200_random chr9_gl000201_random chr11_gl000202_random chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random chrUn_gl000211 chrUn_gl000212 chrUn_gl000213 chrUn_gl000214 chrUn_gl000215 chrUn_gl000216 chrUn_gl000217 chrUn_gl000218 chrUn_gl000219 chrUn_gl000220 chrUn_gl000221 chrUn_gl000222 chrUn_gl000223 chrUn_gl000224 chrUn_gl000225 chrUn_gl000226 chrUn_gl000227 chrUn_gl000228 chrUn_gl000229 chrUn_gl000230 chrUn_gl000231 chrUn_gl000232 chrUn_gl000233 chrUn_gl000234 chrUn_gl000235 chrUn_gl000236 chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240 chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 chrUn_gl000245 chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
[31659] [2019-11-20 13:26:21.155] [GLnexus] [info] db_get_contigs GLnexus.DB
[31659] [2019-11-20 13:26:21.218] [GLnexus] [info] Beginning bulk load with no range filter.
[31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Loaded 6 datasets with 6 samples; 239726128 bytes in 2572592 BCF records (10 duplicate) in 7062 buckets. Bucket max 551480 bytes, 5645 records. 0 BCF records skipped due to caller-specific exceptions
[31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Created sample set *@6
[31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Flushing and compacting database...
[31659] [2019-11-20 13:26:29.064] [GLnexus] [info] Bulk load complete!
[31659] [2019-11-20 13:26:29.089] [GLnexus] [info] found sample set *@6
[31659] [2019-11-20 13:26:29.089] [GLnexus] [info] discovering alleles in 1 range(s)
[31659] [2019-11-20 13:26:35.132] [GLnexus] [info] discovered 258742 alleles
[31659] [2019-11-20 13:26:35.964] [GLnexus] [info] unified to 124038 sites cleanly with 130071 ALT alleles. 140 ALT alleles were additionally included in monoallelic sites and 0 were filtered out on quality thresholds.
[31659] [2019-11-20 13:26:35.964] [GLnexus] [info] Lifting over 4 fields
[31659] [2019-11-20 13:26:35.987] [GLnexus] [info] found sample set *@6
[31659] [2019-11-20 13:26:48.314] [GLnexus] [info] genotyping complete!
[31659] [2019-11-20 13:26:48.314] [GLnexus] [info] worker threads were cumulatively stalled for 343210ms
[31659] [2019-11-20 13:26:48.314] [GLnexus] [info] Num BCF records read 4821531  query hits 771764
[user@cn3144 ~]$ ls
dv_platinum6_chr21.bcf	dv_platinum6_chr21_gvcf  GLnexus.DB  hg19_chr21.bed
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
set -e
module load glnexus
cd /data/user
glnexus --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf

sbatch [--cpus-per-task=#] [--mem=#]
Swarm of Jobs
glnexus --config DeepVariant --bed genomic_range1.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output1.bcf
glnexus --config DeepVariant --bed genomic_range2.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output2.bcf
glnexus --config DeepVariant --bed genomic_range3.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output3.bcf
glnexus --config DeepVariant --bed genomic_range4.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output4.bcf

swarm -f glnexus.swarm [-g #] [-t #] --module glnexus
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module glnexus Loads the glnexus module for each subjob in the swarm