GT-Pro utilizes an exact matching algorithm to perform ultra-rapid and accurate genotyping of known SNPs from metagenomes.
To genotype a microbiome, GT-Pro takes as input one or more shotgun metagenomics sequencing libraries in FASTQ format. It returns counts of reads exactly matching each allele of each SNP in a concise table-shaped format for its output, with one row for each bi-allelic SNP site that has exactly 8 fields: species, global position, contig, contig position, allele 1, allele 2, coverage of allele 1 and coverage of allele 2. The k-mer exact match based genotyping algorithm is optimized for machine specificiations, and it can run on a personal computer.
This application requires a graphical connection using NX
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive -c16 --mem 48g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load gt-pro [+] Loading gt-pro-db, version 20190723_881species... [+] Loading gt-pro, version 1.0.1-20230507... [user@cn3144 ~]$ ln -s $GTPRO_DB* . [user@cn3144 ~]$ ls -l total 0 lrwxrwxrwx 1 user user 59 Aug 5 10:03 20190723_881species_optimized_db_kmer_index.bin -> /fdb/gt-pro/20190723_881species_optimized_db_kmer_index.bin lrwxrwxrwx 1 user user 53 Aug 5 10:03 20190723_881species_optimized_db_snps.bin -> /fdb/gt-pro/20190723_881species_optimized_db_snps.bin [user@cn3144 ~]$ GT_Pro optimize --db $(basename $GTPRO_DB) --in $GTPRO_HOME/test/SRR413665_2.fastq.gz [OK] start initial optimization [OK] database found [OK] optimize from 20190723_881species_optimized_db_kmer_index.bin and 20190723_881species_optimized_db_snps.bin [OK] initial optimization done [OK] finalize optimization with a break-in test [OK] optimization done [user@cn3144 ~]$ GT_Pro genotype -d $(basename $GTPRO_DB) -C $PWD $GTPRO_HOME/test/SRR413665_2.fastq.gz gt_pro 20190723_881species 72 no_overwrite 1722869727070: [Info] Starting to load DB: 20190723_881species 1722869727081: [Info] MMAPPING ./20190723_881species_optimized_db_snps.bin 1722869728016: [Info] MMAPPING ./20190723_881species_optimized_db_kmer_index.bin 1722869745073: [Info] Using -l 32 -m 36 as optimal for system RAM 1722869745073: [Info] MMAPPING ./20190723_881species_optimized_db_mmer_bloom_36.bin 1722869746084: [Info] MMAPPING ./20190723_881species_optimized_db_lmer_index_32.bin 1722869776733: [Info] Done with init for optimized DB with 2856121626 kmers. That took 49 seconds. 1722869776740: [WARNING] Ignoring specified -C prefix for non-relative input path: /usr/local/apps/gt-pro/1.0.1-20230507/test/SRR413665_2.fastq.gz 1722869778292: [Info] Waiting for all readers to quiesce 1722869878492: [Done] searching is completed for the 600000 reads input from /usr/local/apps/gt-pro/1.0.1-20230507/test/SRR413665_2.fastq.gz 1722869878527: [Stats] 102624 snps, 600000 reads, 1.28 hits/snp, for /usr/local/apps/gt-pro/1.0.1-20230507/test/SRR413665_2.fastq.gz 1722869878542: 0.6 million reads were scanned after 101 seconds 1722869878542: Successfully processed 1 input files containing 600000 reads. 1722869878639: Totally done: 101 seconds elapsed processing reads, after DB was loaded. [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. gt-pro.sh). For example:
#!/bin/bash set -e module load gt-pro ln -sf $GTPRO_DB* . GT_Pro optimize --db $(basename $GTPRO_DB) --in $GTPRO_HOME/test/SRR413665_2.fastq.gz GT_Pro genotype -d $(basename $GTPRO_DB) -C $PWD $GTPRO_HOME/test/SRR413665_2.fastq.gz
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] gt-pro.sh
Create a swarmfile (e.g. gt-pro.swarm). For example:
GT_Pro genotype -d $(basename $GTPRO_DB) sample1.fastq.gz GT_Pro genotype -d $(basename $GTPRO_DB) sample2.fastq.gz GT_Pro genotype -d $(basename $GTPRO_DB) sample3.fastq.gz GT_Pro genotype -d $(basename $GTPRO_DB) sample4.fastq.gz
Submit this job using the swarm command.
swarm -f gt-pro.swarm [-g #] [-t #] --module gt-prowhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module gt-pro | Loads the GT-Pro module for each subjob in the swarm |