Scalable gVCF merging and joint variant calling for population sequencing projects.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ cp ${GLNEXUS_TEST_DATA}/* .
[user@cn3144 ~]$ tar -xvf dv_platinum6_chr21_gvcf.tar
dv_platinum6_chr21_gvcf/
dv_platinum6_chr21_gvcf/NA12890.chr21.gvcf.gz
dv_platinum6_chr21_gvcf/NA12892.chr21.gvcf.gz
dv_platinum6_chr21_gvcf/NA12891.chr21.gvcf.gz
dv_platinum6_chr21_gvcf/NA12889.chr21.gvcf.gz
dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz
dv_platinum6_chr21_gvcf/NA12878.chr21.gvcf.gz
[user@cn3144 ~]$ rm dv_platinum6_chr21_gvcf.tar
[user@cn3144 ~]$ ls
dv_platinum6_chr21_gvcf
[user@cn3144 ~]$ module load glnexus
[user@cn3144 ~]$ echo -e "chr21\t0\t48129895" > hg19_chr21.bed
[user@cn3144 ~]$ glnexus_cli --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf
[1485789] [2023-09-08 12:31:07.444] [GLnexus] [info] glnexus_cli release v1.4.1-0-g68e25e5 Aug 13 2021
[1485789] [2023-09-08 12:31:07.445] [GLnexus] [info] detected jemalloc 5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756
[1485789] [2023-09-08 12:31:07.446] [GLnexus] [info] Loading config preset DeepVariant
[1485789] [2023-09-08 12:31:07.449] [GLnexus] [info] config:
unifier_config:
drop_filtered: false
min_allele_copy_number: 1
min_AQ1: 10
min_AQ2: 10
min_GQ: 0
max_alleles_per_site: 32
monoallelic_sites_for_lost_alleles: true
preference: common
genotyper_config:
revise_genotypes: true
min_assumed_allele_frequency: 9.99999975e-05
snv_prior_calibration: 0.600000024
indel_prior_calibration: 0.449999988
required_dp: 0
allow_partial_data: true
allele_dp_format: AD
ref_dp_format: MIN_DP
output_residuals: false
more_PL: true
squeeze: false
trim_uncalled_alleles: true
top_two_half_calls: false
output_format: BCF
liftover_fields:
[...]
[1485789] [2023-09-08 12:31:07.605] [GLnexus] [info] db_get_contigs GLnexus.DB
[1485789] [2023-09-08 12:31:07.674] [GLnexus] [info] Beginning bulk load with no range filter.
[1485789] [2023-09-08 12:31:10.919] [GLnexus] [info] Loaded 6 datasets with 6 samples; 239726128 bytes in 2572592 BCF records (10 duplicate) in 7062 buckets. Bucket max 551480 bytes, 5645 records. 0 BCF records skipped due to caller-specific exceptions
[1485789] [2023-09-08 12:31:10.919] [GLnexus] [info] Created sample set *@6
[1485789] [2023-09-08 12:31:10.919] [GLnexus] [info] Flushing database...
[1485789] [2023-09-08 12:31:11.545] [GLnexus] [info] Bulk load complete!
[1485789] [2023-09-08 12:31:11.558] [GLnexus] [info] found sample set *@6
[1485789] [2023-09-08 12:31:11.558] [GLnexus] [info] discovering alleles in 1 range(s) on 126 threads
[1485789] [2023-09-08 12:31:14.064] [GLnexus] [info] discovered 258742 alleles
[1485789] [2023-09-08 12:31:14.469] [GLnexus] [info] unified to 117841 sites cleanly with 122084 ALT alleles. 66 ALT alleles were additionally included in monoallelic sites and 8061 were filtered out on quality thresholds.
[1485789] [2023-09-08 12:31:14.469] [GLnexus] [info] Finishing database compaction...
[1485789] [2023-09-08 12:31:14.498] [GLnexus] [info] genotyping 117841 sites; sample set = *@6 mem_budget = 0 threads = 128
[1485789] [2023-09-08 12:31:20.343] [GLnexus] [info] genotyping complete!
[1485789] [2023-09-08 12:31:20.343] [GLnexus] [info] worker threads were cumulatively stalled for 456500ms
[1485789] [2023-09-08 12:31:20.343] [GLnexus] [info] Num BCF records read 4574092 query hits 727711
[user@cn3144 ~]$ ls
dv_platinum6_chr21.bcf dv_platinum6_chr21_gvcf GLnexus.DB hg19_chr21.bed
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Create a batch input file (e.g. glnexus.sh). For example:
#!/bin/bash set -e module load glnexus cd /data/user glnexus_cli --config DeepVariant --bed hg19_chr21.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] glnexus.sh
Create a swarmfile (e.g. glnexus.swarm). For example:
glnexus_cli --config DeepVariant --bed genomic_range1.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output1.bcf glnexus_cli --config DeepVariant --bed genomic_range2.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output2.bcf glnexus_cli --config DeepVariant --bed genomic_range3.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output3.bcf glnexus_cli --config DeepVariant --bed genomic_range4.bed dv_platinum6_chr21_gvcf/*.gvcf.gz > output4.bcf
Submit this job using the swarm command.
swarm -f glnexus.swarm [-g #] [-t #] --module glnexuswhere
| -g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
| -t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
| --module glnexus | Loads the glnexus module for each subjob in the swarm |