Tag(ging) It(erative) of SNVs in multiple populations.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --mem=4g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load tagit [user@cn3144 ~]$ cp -r $TAGIT_TEST_DATA/ [user@cn3144 ~]$ cd TEST_DATA [user@cn3144 ~]$ tagit --help Tag(ging)It(erative) v 1.0.8 Usage: tagit --af... --ld ... --r2 Mandatory arguments --af List of tab delimited files (*.gz is supported) with markers and allele frequencies. The mandaroty columns are: CHROM, POS, N_ALLELES, N_CHR, {ALLELE:FREQ}. Typically, such files are generated with the VCFtools software using '--freq' option. --ld List of tab delimited LD files (*.gz is supported) with the pairwise r2 correlation coefficients. The mandatory columns are: MARKER1, MARKER2, AF1, AF2, R2, R. The header starts with the '#' symbol. --r2 An LD threshold. Threshold values are from the (0, 1] interval. Marker M1 tags marker M2 if r^2 between them is greater or equal than the specified threshold. Output files --out-summary Tab delimited output file (gzip compressed) of all markers before the tagging process. Output format is identical to --out-tags, but stores the information for all markers. --out-tags Tab delimited output file (gzip compressed) of tag markers with columns: MARKER, WEIGHT.ALL, WEIGHT.UNIQUE, [1/ ].WEIGHT, [2/ ].WEIGHT, ..., [N/ ].WEIGHT. The MARKER column stores the marker name. The WEIGHT.ALL column stores the tag weight which is the sum of weights of all tagged markers across populations. The WEIGHT.UNIQUE column stores the tag weight which is the sum of weights of unique tagged markers across populations. The order of the [i/ ].WEIGHT columns follows the order of specified files in --af, --ld and --label(if specified) commands. Every [i/ ].WEIGHT column stores the tag weight (analogous to WEIGHT.ALL) only for population i. --out-tagged Tab delimited output file (gzip compressed) of tagged markers with columns: MARKER, 1/ , 2/ , ..., N/ . The MARKER column stores the marker name. The order of the i/ columns follows the order of specified files in --af, --ld and --label(if specified) commands. Every i/ column stores 0 or 1, where 1 means that the marker was tagged in population i. Filtering markers --fix Optional file (*.gz is supported) with fixed tag markers. These markers are set as tags before any other. One marker per line. Marker identifiers should be in the '[chr]:[position]' format. No header. --exclude Optional file (*.gz is supported) with excluded markers. These markers are set as non-tag and can't ne tagged by other markers. One marker per line. Marker identifiers should be in the '[chr]:[position]' format. No header. --hide Optional file (*.gz is supported) with markers to hide. These markers are set as non-tag, but can be tagged by other markers. One marker per line. Marker identifiers should be in the '[chr]:[position]' format. No header. Filtering markers by frequency --exclude-maf A minor allele frequency (MAF) threshold. Threshold values are from the [0.0, 0.5) interval, default threshold is >= 0.0. If marker doesn't satisfy the MAF threshold, then it will not tag any marker in a population and will be not tagged by other markers in that population. This is the default option with the default MAF threshold >= 0.0. --hide-maf A minor allele frequency (MAF) threshold. Threshold values are from the [0.0, 0.5) interval.If marker doesn't satisfy the MAF threshold, then it will not tag any marker in a population but can be tagged by other markers in that population. Weighting --weight Optional list of weights for every specific LD file. Weights must be greater than 0. Default weights are all set to 1. --marker-weight Optional list of files (*.gz is supported) with SNP weights. Weights must be greater than 0. For those SNPs that are not in the specified files, the weight is set to 1. --unique When computing tag weight, consider only unique markers tagged across all populations. By default all markers are considered. Other --label Optional list of short labels for every specified LD file. The short labels will be used in the output. --help Prints this message. [user@cn3144 ~]$ tagit --af Data/AFR.chr20.phase1_release_v3.20101123.freq Data/EUR.chr20.phase1_release_v3.20101123.freq --ld Data/AFR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz Data/EUR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz --r2 0.3 --out-summary summary.txt.gz --out-tagged tagged.txt.gz --out-tags tags.txt.gz Loading markers... Data/AFR.chr20.phase1_release_v3.20101123.freq: 574008 Data/EUR.chr20.phase1_release_v3.20101123.freq: 377174 Done (2.53 sec) Synchronizing markers... Data/AFR.chr20.phase1_release_v3.20101123.freq: 573551 Data/EUR.chr20.phase1_release_v3.20101123.freq: 376768 Done (0.07 sec) ------------------------------------------------------------------------------ Loading allele frequencies... Data/AFR.chr20.phase1_release_v3.20101123.freq (MAF >= 0): 573551 Data/EUR.chr20.phase1_release_v3.20101123.freq (MAF >= 0): 376768 Done (2.24 sec) [...] Writing results... Tags file: tags.txt.gz Tagged file: tagged.txt.gz Done (1.34 sec) ============================================================================== FINAL DATASET HAS 89693 TAG(S). ============================================================================== [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. tagit.sh). For example:
#!/bin/bash set -e module load tagit tagit --af Data/AFR.chr20.phase1_release_v3.20101123.freq Data/EUR.chr20.phase1_release_v3.20101123.freq --ld Data/AFR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz Data/EUR.chr20.phase1_release_v3.20101123.r2_0.7.pairLD.txt.gz --r2 0.3 --out-summary summary.txt.gz --out-tagged tagged.txt.gz --out-tags tags.txt.gz
Submit this job using the Slurm sbatch command.
sbatch --mem=4g tagit.sh
Create a swarmfile (e.g. tagit.swarm). For example:
tagit --af sample.freq --ld pairLD.txt.gz [...] --out-summary out1.txt.gz tagit --af sample.freq --ld pairLD.txt.gz [...] --out-summary out2.txt.gz tagit --af sample.freq --ld pairLD.txt.gz [...] --out-summary out3.txt.gz
Submit this job using the swarm command.
swarm -f tagit.swarm -g 4 --module tagitwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module tagit | Loads the tagit module for each subjob in the swarm |