Scalable gVCF merging and joint variant calling for population sequencing projects.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ wget https://raw.githubusercontent.com/wiki/dnanexus-rnd/GLnexus/data/dv_platinum6_chr21_gvcf.tar.gz [user@cn3144 ~]$ tar -xvf dv_platinum6_chr21_gvcf.tar [user@cn3144 ~]$ rm dv_platinum6_chr21_gvcf.tar [user@cn3144 ~]$ ls dv_platinum6_chr21_gvcf [user@cn3144 ~]$ module load glnexus [user@cn3144 ~]$ echo -e "chr21\t0\t48129895" > hg19_chr21.bed [user@cn3144 ~]$ glnexus --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf [31659] [2019-11-20 13:26:20.979] [GLnexus] [info] glnexus_cli v1.1.11-0-gf4ce0ff, built Mon Oct 7 17:10:41 2019 [31659] [2019-11-20 13:26:20.981] [GLnexus] [warning] jemalloc absent, which will impede performance with high thread counts. See https://github.com/dnanexus-rnd/GLnexus/wiki/Performance [31659] [2019-11-20 13:26:20.981] [GLnexus] [info] Loading config preset DeepVariant [31659] [2019-11-20 13:26:20.986] [GLnexus] [info] config: unifier_config: drop_filtered: false min_allele_copy_number: 1 min_AQ1: 0 min_AQ2: 0 min_GQ: 0 max_alleles_per_site: 0 monoallelic_sites_for_lost_alleles: true preference: common genotyper_config: revise_genotypes: false min_assumed_allele_frequency: 0.0001 required_dp: 0 allow_partial_data: false allele_dp_format: AD ref_dp_format: MIN_DP output_residuals: false squeeze: false output_format: BCF liftover_fields: - {orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true} - {orig_names: [AD], name: AD, description: "##FORMAT= ", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false} - {orig_names: [GQ], name: GQ, description: "##FORMAT= ", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true} - {orig_names: [PL], name: PL, description: "##FORMAT= ", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true} [31659] [2019-11-20 13:26:20.987] [GLnexus] [info] config CRC32C = 754566310 [31659] [2019-11-20 13:26:20.987] [GLnexus] [info] init database, exemplar_vcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz [31659] [2019-11-20 13:26:21.139] [GLnexus] [info] Initialized GLnexus database in GLnexus.DB [31659] [2019-11-20 13:26:21.139] [GLnexus] [info] bucket size: 30000 [31659] [2019-11-20 13:26:21.139] [GLnexus] [info] contigs: chrM chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1_gl000191_random chr1_gl000192_random chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random chr8_gl000197_random chr9_gl000198_random chr9_gl000199_random chr9_gl000200_random chr9_gl000201_random chr11_gl000202_random chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random chrUn_gl000211 chrUn_gl000212 chrUn_gl000213 chrUn_gl000214 chrUn_gl000215 chrUn_gl000216 chrUn_gl000217 chrUn_gl000218 chrUn_gl000219 chrUn_gl000220 chrUn_gl000221 chrUn_gl000222 chrUn_gl000223 chrUn_gl000224 chrUn_gl000225 chrUn_gl000226 chrUn_gl000227 chrUn_gl000228 chrUn_gl000229 chrUn_gl000230 chrUn_gl000231 chrUn_gl000232 chrUn_gl000233 chrUn_gl000234 chrUn_gl000235 chrUn_gl000236 chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240 chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 chrUn_gl000245 chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249 [31659] [2019-11-20 13:26:21.155] [GLnexus] [info] db_get_contigs GLnexus.DB [31659] [2019-11-20 13:26:21.218] [GLnexus] [info] Beginning bulk load with no range filter. [31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Loaded 6 datasets with 6 samples; 239726128 bytes in 2572592 BCF records (10 duplicate) in 7062 buckets. Bucket max 551480 bytes, 5645 records. 0 BCF records skipped due to caller-specific exceptions [31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Created sample set *@6 [31659] [2019-11-20 13:26:27.835] [GLnexus] [info] Flushing and compacting database... [31659] [2019-11-20 13:26:29.064] [GLnexus] [info] Bulk load complete! [31659] [2019-11-20 13:26:29.089] [GLnexus] [info] found sample set *@6 [31659] [2019-11-20 13:26:29.089] [GLnexus] [info] discovering alleles in 1 range(s) [31659] [2019-11-20 13:26:35.132] [GLnexus] [info] discovered 258742 alleles [31659] [2019-11-20 13:26:35.964] [GLnexus] [info] unified to 124038 sites cleanly with 130071 ALT alleles. 140 ALT alleles were additionally included in monoallelic sites and 0 were filtered out on quality thresholds. [31659] [2019-11-20 13:26:35.964] [GLnexus] [info] Lifting over 4 fields [31659] [2019-11-20 13:26:35.987] [GLnexus] [info] found sample set *@6 [31659] [2019-11-20 13:26:48.314] [GLnexus] [info] genotyping complete! [31659] [2019-11-20 13:26:48.314] [GLnexus] [info] worker threads were cumulatively stalled for 343210ms [31659] [2019-11-20 13:26:48.314] [GLnexus] [info] Num BCF records read 4821531 query hits 771764 [user@cn3144 ~]$ ls dv_platinum6_chr21.bcf dv_platinum6_chr21_gvcf GLnexus.DB hg19_chr21.bed [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. glnexus.sh). For example:
#!/bin/bash set -e module load glnexus cd /data/user glnexus --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] glnexus.sh
Create a swarmfile (e.g. glnexus.swarm). For example:
glnexus --config DeepVariant --bed genomic_range1.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output1.bcf glnexus --config DeepVariant --bed genomic_range2.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output2.bcf glnexus --config DeepVariant --bed genomic_range3.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output3.bcf glnexus --config DeepVariant --bed genomic_range4.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output4.bcf
Submit this job using the swarm command.
swarm -f glnexus.swarm [-g #] [-t #] --module glnexuswhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module glnexus | Loads the glnexus module for each subjob in the swarm |