High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
smc++ on Biowulf

SMC++ is used to esimate the history of effective population size for populations. The general workflow is

  1. Convert VCF to the input format for SMC++ with smc++ vcf2smc
  2. Fit the model with smc++ estimate
  3. Visualize the results with smc++ plot

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load smc
[+] Loading smc++ 1.9.4

[user@cn3144 ~]$ smc++ -h
usage: smc++ [-h] {cite,estimate,plot,posterior,split,vcf2smc,version} ...

positional arguments:
  {cite,estimate,plot,posterior,split,vcf2smc,version}
    cite                Print citation information for SMC++
    estimate            Estimate size history for one population
    plot                Plot size history from fitted model
    posterior           Store/visualize posterior decoding of TMRCA
    split               Estimate split time in two population model
    vcf2smc             Convert VCF to SMC++ format
    version             Print version string

optional arguments:
  -h, --help            show this help message and exit

[user@cn3144 ~]$ cp $SMCPP_TEST_DATA/* .
[user@cn3144 ~]$ ls -lh

[user@cn3144 ~]$ smc++ vcf2smc example.vcf.gz smc/chr1.smc.gz 1 Pop1:msp_0,msp_1,msp_2,msp_3,msp_4
966 smcpp.commands.vcf2smc WARNING Neither missing cutoff (-c) or mask (-m) has been specified. This means that stretches of the chromosome 
that do not have any VCF entries (for example, centromeres) will be interpreted as homozygous recessive.                                    
967 smcpp.commands.vcf2smc INFO Population 1:                                                                                               
967 smcpp.commands.vcf2smc INFO Distinguished lineages: msp_0:0, msp_0:1                                                                    
967 smcpp.commands.vcf2smc INFO Undistinguished lineages: msp_1:0, msp_1:1, msp_2:0, msp_2:1, msp_3:0, msp_3:1, msp_4:0, msp_4:1            
100%|#############################################################################################9| 1000K/1.00M
1067 smcpp.util   INFO Wrote 2343 observations                                                                                              

[user@cn3144 ~]$ smc++ estimate -o analysis/ 1.25e-8 smc/*.smc.gz
[...snip...]
166388 smcpp.optimize.plugins.loglikelihood_monitor INFO Log-likelihood improvement < tol=0.0001; terminating 

[user@cn3144 ~]$ smc++ plot example.png analysis/model.final.json

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

This results in a plot similar to this:

smc++ example plot

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. smc++.sh), which uses the input file 'smc++.in'. For example:

#!/bin/bash
# this file is smcpp.sh
set -e

module load smc++ || exit 1

smc++ vcf2smc example.vcf.gz smc/chr1.smc.gz 1 Pop1:msp_0,msp_1,msp_2,msp_3,msp_4
smc++ estimate -o analysis/ 1.25e-8 smc/*.smc.gz
smc++ plot example.png analysis/model.final.json

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] smcpp.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit set of independent commands requiring identical resources.

Create a swarmfile (e.g. smc++.swarm). For example:

# this file is smcpp.swarm
smc++ vcf2smc my.vcf.gz smc/chr1.smc.gz chr1 Pop1:S1,S2,S3
smc++ vcf2smc my.vcf.gz smc/chr2.smc.gz chr2 Pop1:S1,S2,S3
smc++ vcf2smc my.vcf.gz smc/chr3.smc.gz chr3 Pop1:S1,S2,S3
smc++ vcf2smc my.vcf.gz smc/chr4.smc.gz chr4 Pop1:S1,S2,S3
smc++ vcf2smc my.vcf.gz smc/chr5.smc.gz chr5 Pop1:S1,S2,S3

Submit this job using the swarm command.

swarm -f smcpp.swarm [-g #] [-t #] --module smc++
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module smc++ Loads the smc++ module for each subjob in the swarm