GT-Pro on Biowulf

GT-Pro utilizes an exact matching algorithm to perform ultra-rapid and accurate genotyping of known SNPs from metagenomes.

To genotype a microbiome, GT-Pro takes as input one or more shotgun metagenomics sequencing libraries in FASTQ format. It returns counts of reads exactly matching each allele of each SNP in a concise table-shaped format for its output, with one row for each bi-allelic SNP site that has exactly 8 fields: species, global position, contig, contig position, allele 1, allele 2, coverage of allele 1 and coverage of allele 2. The k-mer exact match based genotyping algorithm is optimized for machine specificiations, and it can run on a personal computer.

References:

Documentation
Important Notes

This application requires a graphical connection using NX

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive -c16 --mem 48g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load gt-pro
[+] Loading gt-pro-db, version 20190723_881species...
[+] Loading gt-pro, version 1.0.1-20230507...
[user@cn3144 ~]$ ln -s $GTPRO_DB* .
[user@cn3144 ~]$ ls -l
total 0
lrwxrwxrwx 1 user user 59 Aug  5 10:03 20190723_881species_optimized_db_kmer_index.bin -> /fdb/gt-pro/20190723_881species_optimized_db_kmer_index.bin
lrwxrwxrwx 1 user user 53 Aug  5 10:03 20190723_881species_optimized_db_snps.bin -> /fdb/gt-pro/20190723_881species_optimized_db_snps.bin
[user@cn3144 ~]$ GT_Pro optimize --db $(basename $GTPRO_DB) --in $GTPRO_HOME/test/SRR413665_2.fastq.gz
[OK] start initial optimization
[OK] database found
[OK] optimize from 20190723_881species_optimized_db_kmer_index.bin and 20190723_881species_optimized_db_snps.bin
[OK] initial optimization done
[OK] finalize optimization with a break-in test
[OK] optimization done
[user@cn3144 ~]$ GT_Pro genotype -d $(basename $GTPRO_DB) -C $PWD $GTPRO_HOME/test/SRR413665_2.fastq.gz
gt_pro  20190723_881species     72      no_overwrite
1722869727070:  [Info] Starting to load DB: 20190723_881species
1722869727081:  [Info] MMAPPING ./20190723_881species_optimized_db_snps.bin
1722869728016:  [Info] MMAPPING ./20190723_881species_optimized_db_kmer_index.bin
1722869745073:  [Info] Using -l 32 -m 36 as optimal for system RAM
1722869745073:  [Info] MMAPPING ./20190723_881species_optimized_db_mmer_bloom_36.bin
1722869746084:  [Info] MMAPPING ./20190723_881species_optimized_db_lmer_index_32.bin
1722869776733:  [Info] Done with init for optimized DB with 2856121626 kmers.  That took 49 seconds.
1722869776740: [WARNING] Ignoring specified -C prefix for non-relative input path: /usr/local/apps/gt-pro/1.0.1-20230507/test/SRR413665_2.fastq.gz
1722869778292:  [Info] Waiting for all readers to quiesce
1722869878492:  [Done] searching is completed for the 600000 reads input from /usr/local/apps/gt-pro/1.0.1-20230507/test/SRR413665_2.fastq.gz
1722869878527:  [Stats] 102624 snps, 600000 reads, 1.28 hits/snp, for /usr/local/apps/gt-pro/1.0.1-20230507/test/SRR413665_2.fastq.gz
1722869878542:  0.6 million reads were scanned after 101 seconds
1722869878542:  Successfully processed 1 input files containing 600000 reads.
1722869878639:  Totally done: 101 seconds elapsed processing reads, after DB was loaded.
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. gt-pro.sh). For example:

#!/bin/bash
set -e
module load gt-pro

ln -sf $GTPRO_DB* .
GT_Pro optimize --db $(basename $GTPRO_DB) --in $GTPRO_HOME/test/SRR413665_2.fastq.gz
GT_Pro genotype -d $(basename $GTPRO_DB) -C $PWD $GTPRO_HOME/test/SRR413665_2.fastq.gz

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] gt-pro.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. gt-pro.swarm). For example:

GT_Pro genotype -d $(basename $GTPRO_DB) sample1.fastq.gz
GT_Pro genotype -d $(basename $GTPRO_DB) sample2.fastq.gz
GT_Pro genotype -d $(basename $GTPRO_DB) sample3.fastq.gz
GT_Pro genotype -d $(basename $GTPRO_DB) sample4.fastq.gz

Submit this job using the swarm command.

swarm -f gt-pro.swarm [-g #] [-t #] --module gt-pro
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module gt-pro Loads the GT-Pro module for each subjob in the swarm