The Pangenome Graph Builder (pggb) builds pangenome variation graphs from a set of input sequences.A pangenome variation graph is a kind of generic multiple sequence alignment. It lets us understand any kind of sequence variation between a collection of genomes. It shows us similarity where genomes walk through the same parts of the graph, and differences where they do not.
pggb generates this kind of graph using an all-to-all alignment of input sequences (wfmash), graph induction (seqwish), and progressive normalization (smoothxg, gfaffix). After construction, pggb generates diagnostic visualizations of the graph (odgi). A variant call report (in VCF) representing both small and large variants can be generated based on any reference genome included in the graph (vg). pggb writes its output in GFAv1 format, which can be used as input by numerous "genome graph" and pangenome tools, such as the vg and odgi toolkits.
pggb has been tested at scale in the Human Pangenome Reference Consortium (HPRC) as a method to build a graph from the draft human pangenome.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --gres lscratch:10 --cpus-per-task 6 --mem 16g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load pggb [user@cn3144 ~]$ pggb -i $PGGB_HOME/data/HLA/DRB1-3123.fa.gz -n 12 -t $SLURM_CPUS_PER_TASK -o out -M --temp-dir /lscratch/$SLURM_JOB_ID [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. pggb.sh). For example:
#!/bin/bash set -e module load pggb pggb -i $PGGB_HOME/data/HLA/DRB1-3123.fa.gz -n 12 -t $SLURM_CPUS_PER_TASK -V 'gi|568815561' -o out -M --temp-dir /lscratch/$SLURM_JOB_ID
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=# [--mem=#] --gres lscratch:# pggb.sh
Create a swarmfile (e.g. pggb.swarm). For example:
pggb -i output/DRB1-3123.fa.gz.bf3285f.community.0.fa \ -o output/DRB1-3123.fa.gz.bf3285f.community.0.fa.out \ -s 5000 -l 25000 -p 90 -c 1 -K 19 -F 0.001 -g 30 \ -k 23 -f 0 -B 10M \ -n 12 -j 0 -e 0 -G 700,900,1100 -p 1,4,6,2,26,1 -O 0.001 -d 100 -Q Consensus_ \ -Y "#" --temp-dir /lscratch/$SLURM_JOB_ID --threads $SLURM_CPUS_PER_TASK --poa-threads $SLURM_CPUS_PER_TASK pggb -i output/DRB1-3123.fa.gz.bf3285f.community.1.fa \ -o output/DRB1-3123.fa.gz.bf3285f.community.1.fa.out \ -s 5000 -l 25000 -p 90 -c 1 -K 19 -F 0.001 -g 30 \ -k 23 -f 0 -B 10M \ -n 12 -j 0 -e 0 -G 700,900,1100 -p 1,4,6,2,26,1 -O 0.001 -d 100 -Q Consensus_ \ -Y "#" --temp-dir /lscratch/$SLURM_JOB_ID --threads $SLURM_CPUS_PER_TASK --poa-threads $SLURM_CPUS_PER_TASK
You can use the partition-before-pggb command to help create the swarmfile for you by preparing the input and generating the commands.
Submit this job using the swarm command.
swarm -f pggb.swarm [-g #] [-t #] --module pggbwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module pggb | Loads the pggb module for each subjob in the swarm |