Biowulf High Performance Computing at the NIH
TagIt on Biowulf

Tag(ging) It(erative) of SNVs in multiple populations.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --mem=4g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load tagit

[user@cn3144 ~]$ cp -r $TAGIT_TEST_DATA/

[user@cn3144 ~]$ cd TEST_DATA

[user@cn3144 ~]$ tagit --help
Tag(ging)It(erative) v 1.0.8


Usage: tagit --af  ...  --ld  ...  --r2 

Mandatory arguments
 --af     List of tab delimited files (*.gz is supported) with markers and allele frequencies. The mandaroty columns are: CHROM, POS, N_ALLELES, N_CHR, {ALLELE:FREQ}. Typically, such files are generated with the VCFtools software using '--freq' option.

 --ld     List of tab delimited LD files (*.gz is supported) with the pairwise r2 correlation coefficients. The mandatory columns are: MARKER1, MARKER2, AF1, AF2, R2, R. The header starts with the '#' symbol.

 --r2     An LD threshold. Threshold values are from the (0, 1] interval. Marker M1 tags marker M2 if r^2 between them is greater or equal than the specified threshold.

Output files
 --out-summary    Tab delimited output file (gzip compressed) of all markers before the tagging process. 
Output format is identical to --out-tags, but stores the information for all markers.

 --out-tags   Tab delimited output file (gzip compressed) of tag markers with columns: MARKER, WEIGHT.ALL, WEIGHT.UNIQUE, [1/].WEIGHT, [2/].WEIGHT, ..., [N/].WEIGHT. The MARKER column stores the marker name. The WEIGHT.ALL column stores the tag weight which is the sum of weights of all tagged markers across populations. The WEIGHT.UNIQUE column stores the tag weight which is the sum of weights of unique tagged markers across populations. The order of the [i/].WEIGHT columns follows the order of specified files in --af, --ld and --label(if specified) commands. Every [i/].WEIGHT column stores the tag weight (analogous to WEIGHT.ALL) only for population i.

 --out-tagged   Tab delimited output file (gzip compressed) of tagged markers with columns: MARKER, 1/, 2/, ..., N/. The MARKER column stores the marker name. The order of the i/ columns follows the order of specified files in --af, --ld and --label(if specified) commands. Every i/ column stores 0 or 1, where 1 means that the marker was tagged in population i.

Filtering markers
 --fix      Optional file (*.gz is supported) with fixed tag markers. These markers are set as tags before any other. One marker per line. Marker identifiers should be in the '[chr]:[position]' format. No header.

 --exclude    Optional file (*.gz is supported) with excluded markers. These markers are set as non-tag and can't ne tagged by other markers. One marker per line. Marker identifiers should be in the '[chr]:[position]' format. No header.

 --hide     Optional file (*.gz is supported) with markers to hide. These markers are set as non-tag, but can be tagged by other markers. One marker per line. Marker identifiers should be in the '[chr]:[position]' format. No header.

Filtering markers by frequency
 --exclude-maf    A minor allele frequency (MAF) threshold. Threshold values are from the [0.0, 0.5) interval, default threshold is >= 0.0. If marker doesn't satisfy the MAF threshold, then it will not tag any marker in a population and will be not tagged by other markers in that population. This is the default option with the default MAF threshold >= 0.0.

 --hide-maf   A minor allele frequency (MAF) threshold. Threshold values are from the [0.0, 0.5) interval.If marker doesn't satisfy the MAF threshold, then it will not tag any marker in a population but can be tagged by other markers in that population.

Weighting
 --weight   Optional list of weights for every specific LD file. Weights must be greater than 0. Default weights are all set to 1.

 --marker-weight  Optional list of files (*.gz is supported) with SNP weights. Weights must be greater than 0. For those SNPs that are not in the specified files, the weight is set to 1.

 --unique   When computing tag weight, consider only unique markers tagged across all populations. By default all markers are considered.

Other
 --label    Optional list of short labels for every specified LD file. The short labels will be used in the output.

 --help     Prints this message.

[user@cn3144 ~]$ tagit --af Data/AFR.chr20.phase1_release_v3.20101123.freq Data/EUR.chr20.phase1_release_v3.20101123.freq --ld Data/AFR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz Data/EUR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz --r2 0.3 --out-summary summary.txt.gz --out-tagged tagged.txt.gz --out-tags tags.txt.gz
Loading markers... 
 Data/AFR.chr20.phase1_release_v3.20101123.freq: 574008
 Data/EUR.chr20.phase1_release_v3.20101123.freq: 377174
Done (2.53 sec)
Synchronizing markers... 
 Data/AFR.chr20.phase1_release_v3.20101123.freq: 573551
 Data/EUR.chr20.phase1_release_v3.20101123.freq: 376768
Done (0.07 sec)
------------------------------------------------------------------------------
Loading allele frequencies... 
 Data/AFR.chr20.phase1_release_v3.20101123.freq (MAF >= 0): 573551
 Data/EUR.chr20.phase1_release_v3.20101123.freq (MAF >= 0): 376768
Done (2.24 sec)
[...]
Writing results... 
 Tags file: tags.txt.gz
 Tagged file: tagged.txt.gz
Done (1.34 sec)
==============================================================================
FINAL DATASET HAS 89693 TAG(S).
==============================================================================

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. tagit.sh). For example:

#!/bin/bash
set -e
module load tagit
tagit --af Data/AFR.chr20.phase1_release_v3.20101123.freq Data/EUR.chr20.phase1_release_v3.20101123.freq --ld Data/AFR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz Data/EUR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz --r2 0.3 --out-summary summary.txt.gz --out-tagged tagged.txt.gz --out-tags tags.txt.gz

Submit this job using the Slurm sbatch command.

sbatch --mem=4g tagit.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. tagit.swarm). For example:

tagit --af sample.freq --ld pairLD.txt.gz [...] --out-summary out1.txt.gz 
tagit --af sample.freq --ld pairLD.txt.gz [...] --out-summary out2.txt.gz 
tagit --af sample.freq --ld pairLD.txt.gz [...] --out-summary out3.txt.gz 

Submit this job using the swarm command.

swarm -f tagit.swarm -g 4 --module tagit
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module tagit Loads the tagit module for each subjob in the swarm