High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
fastStructure on Biowulf & Helix

Description

fastStructure infers population structure from large single nucleotide polymorphism (SNP) data sets. It is based on a variational Bayesian framework for posterior inference.

It contains three tools:

structure
Perform inference for a simple, independent-loci, admixture model, with two possible priors
chooseK
Choose the number of model components
distruct
Visualize admixture proportions

References

  • Anil Raj, M. Stephens, and J. K. Pritchard. fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics, 2014, 197:573-589. Pubmed |  PMC |  Journal

Web sites

On Helix

Please use fastStructure on helix only for small datasets.

Set up the environment for fastStructure

helix$ module load fastStructure
helix$ module helf fastStructure
---------- Module Specific Help for "fastStructure/1.0" ----------
This module sets up the environment for using fastStructure.

There is a wrapper script for the three main functions of this package:

    fastStructure structure [options]
    fastStructure chooseK [options]
    fastStructure distruct [options]

Alternatively, call python scripts directly with

    structure.py [options]
    chooseK.py [options]
    distruct.py [options]

There are two different ways to use fastStructure on our systems. Either use it similarly to the way described in the fastStructure docs:

helix$ structure.py -K 3 --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
  --output=testoutput_simple --full --seed=100

Or use a wrapper for the three scripts

helix$ fastStructure structure -K 3 \
  --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
  --output=testoutput_simple --full --seed=100

Run fastStructure for different numbers of model components and pick the best model

helix$ for i in {2..10}; do
  fastStructure structure -K $i \
    --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
    --output=testoutput_simple --full --seed=100
done
helix$ fastStructure chooseK --input testoutput_simple
Model complexity that maximizes marginal likelihood = 2
Model components used to explain structure in data = 4
Batch job on Biowulf

Create a batch script similar to the following

#! /bin/bash

module load fastStructure || exit 1
fastStructure structure -K 3 \
  --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
  --output=testoutput_simple --full --seed=100

Submit to the queue with

b2$ sbatch --mem=Xg faststruct_batch.sh

Choosing an amount of memory X suitable for the size of the input data.

Swarm of jobs on Biowulf

Create a swarm command file with one command per line (allowing line continuations)

fastStructure structure -K 2 \
  --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
  --output=testoutput_simple --full --seed=100
fastStructure structure -K 3 \
  --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
  --output=testoutput_simple --full --seed=100
fastStructure structure -K 4 \
  --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
  --output=testoutput_simple --full --seed=100

Submit the commands as a swarm to the job queue

b2$ swarm -f faststruct.swarm -g X

Again choosing an appropriate amount of memory X in GB.

Interactive job on Biowulf

Allocate an interactive session and then use as described above

b2$ sinteractive --mem=8G
salloc.exe: Pending job allocation 5653353
salloc.exe: job 5653353 queued and waiting for resources
salloc.exe: job 5653353 has been allocated resources
salloc.exe: Granted job allocation 5653353
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1699 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
cn1699$ for i in {2..10}; do
  fastStructure structure -K $i \
    --input=/usr/local/apps/fastStructure/TEST_DATA/testdata \
    --output=testoutput_simple --full --seed=100
done
cn1699$ exit
b2$
Documentation