Biowulf High Performance Computing at the NIH
SEQMIX on Biowulf

SEQMIX is a program that takes advantage of off-targeted sequence reads from exome/targeted sequencing experiments for accurate local ancestry inference.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf ~]$ sinteractive --gres lscratch:1
salloc.exe: Pending job allocation 53362357
salloc.exe: job 53362357 queued and waiting for resources
salloc.exe: job 53362357 has been allocated resources
salloc.exe: Granted job allocation 53362357
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3126 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
[user@cn3126 ~]$ module load seqmix
[+] Loading seqmix, version 0.1...
[user@cn3126 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3126 53362357]$ mkdir seqmix-example/
[user@cn3126 53362357]$ cp $SEQMIX_HOME/examples/* seqmix-example/
[user@cn3126 53362357]$ cd !$
cd seqmix-example/
[user@cn3126 seqmix-example]$ cat example.sh # look at what we're about to run
## step 1, LD pruning
echo "LD pruning"
ldPruneVCF.release.pl LD_plink_w1M_r2_snp1000_Rsq_0.01_Eur_20.ld.merged.txt.Rsq0.1.txt 0.1 example.vcf 9 example_pruned PL GD

## step 2, running SEQMIX
echo "running SEQMIX"
localAncestry_release --fileToDecompose example_pruned.out.sub9.R0.1.pruned.vcf --freqCEUfile EUR.frq.example.txt --freqYRIfile AFR.frq.example.txt --geneticDistfile geneticDistance.txt --outputAncestryF resultLocalAncestry --GLLkey PL --T 4 --SubjCol 9
[user@cn3126 seqmix-example]$ ./example.sh
LD pruning
writing to file example_pruned.out.sub9.R0.1.pruned.vcf
Openning vcf file example.vcf ...
the genotype likelihood in GT:GD:GQ:PL is 3 (PL) 
the coverage in GT:GD:GQ:PL is 1 (GD) 
Reading in the LD information from the file LD_plink_w1M_r2_snp1000_Rsq_0.01_Eur_20.ld.merged.txt.Rsq0.1.txt ... 
There are 965 keys in the Rpairs hash
Openning vcf file example.vcf ...
There are 986 in the cvg hash
There are 230 in the out hash
running SEQMIX
SEQMIX 0.1 -- local ancestry inference for SEQuenced adMIXed individuals
(c) 2013 Youna Hu, Goncalo Abecasis

The following parameters are in effect:

Local ancestry decomposition:
          I/O files : --fileToDecompose [example_pruned.out.sub9.R0.1.pruned.vcf],
                      --freqCEUfile [EUR.frq.example.txt],
                      --freqYRIfile [AFR.frq.example.txt],
                      --geneticDistfile [geneticDistance.txt],
                      --outputAncestryF [resultLocalAncestry]
   Input Parameters : --GLLkey [PL], --T(Admixed Time) [4], --SubjCol [9]

The FORMAT entry is GT:GD:PL (size 3).
Subject to decompose ( ID: NA19700 ).
The size of variantID: 986
The size of refA: 986
The size of altA: 986
reading file EUR.frq.example.txt
The number of refefence SNPs is 9999.
reading file AFR.frq.example.txt
The number of refefence SNPs is 9999.
reading file geneticDistance.txt
update geno - size of ppln1 9999
update geno - size of ppln2 9999
update geno --size to decomposed 986
update gDsit -- size of gDist 51753
size of overlapGLs: 986
size of overlapAFs: 986
size of overlapVariantID: 986
size of overlapGDist: 986
The size of emission probabilities: 986 4
The size of transition probabilities: 986

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
size of alpha = 986 4
size of beta = 986 4
size of condP = 986
Writing result to file resultLocalAncestryNA19700.txt
Done!
[user@cn3126 seqmix-example]$ exit
salloc.exe: Relinquishing job allocation 53362357
salloc.exe: Job allocation 53362357 has been revoked.
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. seqmix.sh). For example:

#!/bin/sh
set -e
module load seqmix

cp $SEQMIX_HOME/examples/* .

# from example.sh
## step 1, LD pruning
ldPruneVCF.release.pl \
 LD_plink_w1M_r2_snp1000_Rsq_0.01_Eur_20.ld.merged.txt.Rsq0.1.txt \
 0.1 example.vcf 9 example_pruned PL GD

## step 2, running SEQMIX
localAncestry_release \
 --fileToDecompose example_pruned.out.sub9.R0.1.pruned.vcf \
 --freqCEUfile EUR.frq.example.txt \
 --freqYRIfile AFR.frq.example.txt \
 --geneticDistfile geneticDistance.txt \
 --outputAncestryF resultLocalAncestry \
 --GLLkey PL \
 --T 4 \
 --SubjCol 9

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] seqmix.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. seqmix.swarm). For example:

ldPruneVCF.release.pl LD-paired-Rsq.txt 0.1 sample1.vcf 9 sample1-pruned PL GD \
&& localAncestry_release \
    --fileToDecompose sample1-pruned.out.sub9.R0.1.pruned.vcf \
    --freqCEUfile EUR.frq.txt \
    --freqYRIfile AFR.frq.txt \
    --geneticDistfile geneticDistance.txt \
    --outputAncestryF sample1-local-ancestry \
    --GLLkey PL \
    --T 4 \
    --SubjCol 9
ldPruneVCF.release.pl LD-paired-Rsq.txt 0.1 sample2.vcf 9 sample2-pruned PL GD \
&& localAncestry_release \
    --fileToDecompose sample2-pruned.out.sub9.R0.1.pruned.vcf \
    --freqCEUfile EUR.frq.txt \
    --freqYRIfile AFR.frq.txt \
    --geneticDistfile geneticDistance.txt \
    --outputAncestryF sample2-local-ancestry \
    --GLLkey PL \
    --T 4 \
    --SubjCol 9
ldPruneVCF.release.pl LD-paired-Rsq.txt 0.1 sample3.vcf 9 sample3-pruned PL GD \
&& localAncestry_release \
    --fileToDecompose sample3-pruned.out.sub9.R0.1.pruned.vcf \
    --freqCEUfile EUR.frq.txt \
    --freqYRIfile AFR.frq.txt \
    --geneticDistfile geneticDistance.txt \
    --outputAncestryF sample3-local-ancestry \
    --GLLkey PL \
    --T 4 \
    --SubjCol 9
ldPruneVCF.release.pl LD-paired-Rsq.txt 0.1 sample4.vcf 9 sample4-pruned PL GD \
&& localAncestry_release \
    --fileToDecompose sample4-pruned.out.sub9.R0.1.pruned.vcf \
    --freqCEUfile EUR.frq.txt \
    --freqYRIfile AFR.frq.txt \
    --geneticDistfile geneticDistance.txt \
    --outputAncestryF sample4-local-ancestry \
    --GLLkey PL \
    --T 4 \
    --SubjCol 9

Submit this job using the swarm command.

swarm -f seqmix.swarm [-g #] [-t #] --module seqmix
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module seqmix Loads the seqmix module for each subjob in the swarm