GangSTR on Biowulf
The GangSTR is a tool for genome-wide profiling tandem repeats from short reads, it can handle repeats that are longer than the read length.
References:
- Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions.Nucleic Acids Res. 2019 Sep 5;47(15):e90. doi: 10.1093/nar/gkz501.PubMed | Journal
Documentation
- GangSTR Github:Github
Important Notes
- Module Name: GangSTR (see the modules page for more information)
- Current GangSTR command lines could be run as:
GangSTR
Interactive jobInteractive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.Allocate an interactive session and run the program.
Sample session (user input in bold):[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load GangSTR [user@cn3144 ~]$ mkdir /data/$USER/GangSTR_test/ [user@cn3144 ~]$ cd /data/$USER/GangSTR_test/ [user@cn3144 ~]$ cp $GANGSTR_TEST_DATA/* . [user@cn3144 ~]$ GangSTR --help Usage: GangSTR [OPTIONS] --bam
--ref --regions --out Required options: --bam Comma separated list of input BAM files --ref FASTA file for the reference genome --regions BED file containing TR coordinates --out Prefix to name output files Additional general options: --targeted Targeted mode --chrom Only genotype regions on this chromosome --bam-samps Comma separated list of sample IDs for --bam --samp-sex Comma separated list of sample sex for each sample ID (--bam-samps must be provided) --str-info Tab file with additional per-STR info (see docs) --period Only genotype loci with periods (motif lengths) in this comma-separated list. --skip-qscore Skip calculation of Q-score Options for different sequencing settings --readlength Read length. Default: -1 --coverage Average coverage. must be set for exome/targeted data. Comma separated list to specify for each BAM --model-gc-coverage Model coverage as a function of GC content. Requires genome-wide data --insertmean Fragment length mean. Comma separated list to specify for each BAM separately. --insertsdev Fragment length standard deviation. Comma separated list to specify for each BAM separately. --nonuniform Indicate whether data has non-uniform coverage (i.e., exome) --min-sample-reads Minimum number of reads per sample. Advanced paramters for likelihood model: --frrweight Weight for FRR reads. Default: 1 --enclweight Weight for enclosing reads. Default: 1 --spanweight Weight for spanning reads. Default: 1 --flankweight Weight for flanking reads. Default: 1 --ploidy Indicate whether data is haploid (1) or diploid (2). Default: -1 --skipofftarget Skip off target regions included in the BED file. --read-prob-mode Use only read probability (ignore class probability) --numbstrap Number of bootstrap samples. Default: 100 --grid-threshold Use optimization rather than grid search to find MLE if more than this many possible alleles. Default: 10000 --rescue-count Number of regions that GangSTR attempts to rescue mates from (excluding off-target regions) Default: 0 --max-proc-read Maximum number of processed reads per sample before a region is skipped. Default: 3000 Parameters for local realignment: --minscore Minimum alignment score (out of 100). Default: 75 --minmatch Minimum number of matching basepairs on each end of enclosing reads. Default: 5 Default stutter model parameters: --stutterup Stutter insertion probability. Default: 0.05 --stutterdown Stutter deletion probability. Default: 0.05 --stutterprob Stutter step size parameter. Default: 0.9 Parameters for more detailed info about each locus: --output-bootstraps Output file with bootstrap samples --output-readinfo Output read class info (for debugging) --include-ggl Output GGL (special GL field) in VCF Additional optional paramters: -h,--help display this help screen --seed Random number generator initial seed -v,--verbose Print out useful progress messages --very Print out more detailed progress messages for debugging --quiet Don't print anything --version Print out the version of this software. This program takes in aligned reads in BAM format and outputs estimated genotypes at each TR in VCF format. [user@cn3144 ~]$ GangSTR --bam nc10_25.sorted.bam --ref \ /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa --regions HTT.bed --out GangSTR_out [GangSTR-2.5.0] ProgressMeter: Loading read group id CompMultiLoc_1_cov50_readLen_100 for sample 10_25 [GangSTR-2.5.0] ProgressMeter: Processing chr4:3074877 [GangSTR-2.5.0] ProgressMeter: Genotyper Results: 10, 25 likelihood = 1275.52 [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$ Batch jobMost jobs should be run as batch jobs.Create a batch input file (e.g. GangSTR.sh). For example:
#!/bin/bash set -e module load GangSTR GangSTR \ --bam nc10_25.sorted.bam \ --ref /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ --regions HTT.bed \ --out GangSTR_out
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=2 --mem=2g GangSTR.sh
Swarm of JobsA swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.Create a swarmfile (e.g. GangSTR.swarm). For example:
cd dir1;GangSTR --bam test1.bam --ref genome.fa --regions test1.bed --out test1 cd dir1;GangSTR --bam test2.bam --ref genome.fa --regions test2.bed --out test2
Submit this job using the swarm command.
swarm -f GangSTR.swarm [-t #] [-g #] --module GangSTR
where-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file) -t # Number of threads/CPUs required for each process (1 line in the swarm command file). --module GangSTR Loads the GangSTR module for each subjob in the swarm