Biowulf High Performance Computing at the NIH
genomestrip on Biowulf

From the Genome STRiP page:

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovering and genotyping structural variations using sequencing data. The methods are designed to detect shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation. The methods are adaptive and support heterogeneous data sets, including variations in sequencing depth, read lengths and mixtures of paired and single-end reads. A minimum of 20 to 30 genomes are required to get acceptable results, but the method gains power across genomes and processing more genomes provide better results. To run discovery or genotyping on a single sequenced genome or a small set of genomes, you need to call your data against a background population, such as a set of genomes from the 1000 Genomes Project. The background population does not need to be matched to the target individuals. Genome STRiP can be used for discovery of novel structural variations or to genotype known variants in new samples.

References:

  • R. E. Handsaker, V. Van Doren, J. R. Berman, G. Genovese, S. Kashin, L. M. Boettger LM and S. A. McCarroll. Large multiallelic copy number variations in humans Nature Genetics 47, 296-303 (2015). PubMed |  PMC |  Journal
  • R. E. Handsaker, J. M. Korn, J. Nemesh and S. A. McCarroll. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature Genetics 43, 269-276 (2011). PubMed |  PMC |  Journal
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=20g --gres=lscratch:20 -c4
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load genomestrip
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ cp -r $GENOMESTRIP_TEST_DATA test
[user@cn3144]$ cd test
[user@cn3144]$ mkdir -p test1/logs test1/metadata
[user@cn3144]$ java -cp $GENOMESTRIP_CP -Xmx4g -jar $SV_DIR/lib/SVToolkit.jar
SVToolkit version 2.00 (build 1833)
Build date: 2018/04/10 11:05:26
Web site: http://www.broadinstitute.org/software/genomestrip

Preprocessing data. Using a config file adapted for analyzing a 200kb chunk of aligned data rather than a whole genome

[user@cn3144]$ java -cp ${GENOMESTRIP_CP} -Xmx8g \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVPreprocess.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${GENOMESTRIP_CP} \
    -configFile conf/genstrip_installtest_parameters.txt \
    -tempDir ./tmp \
    -R data/human_b36_chr1.fasta \
    -genomeMaskFile data/human_b36_chr1.svmask.fasta \
    -copyNumberMaskFile data/human_b36_chr1.gcmask.fasta \
    -genderMapFile data/installtest_gender.map \
    -runDirectory test1 \
    -md test1/metadata \
    -disableGATKTraversal \
    -useMultiStep \
    -L 1:61700000-61900000 \
    -reduceInsertSizeDistributions false \
    -computeGCProfiles true \
    -computeReadCounts true \
    -jobLogDir test1/logs \
    -I data/installtest.bam \
    -P chimerism.use.correction:false \
    -run

INFO  12:17:15,130 QScriptManager - Compiling 2 QScripts
....

SV discovery


[user@cn3144]$ java -cp ${GENOMESTRIP_CP} -Xmx8g \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVDiscovery.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${GENOMESTRIP_CP} \
    -configFile conf/genstrip_installtest_parameters.txt \
    -tempDir ./tmp \
    -R data/human_b36_chr1.fasta \
    -genomeMaskFile data/human_b36_chr1.svmask.fasta \
    -genderMapFile data/installtest_gender.map \
    -runDirectory test1 \
    -md test1/metadata \
    -disableGATKTraversal \
    -jobLogDir temp1/logs \
    -L 1 \
    -minimumSize 100 \
    -maximumSize 1000000 \
    -suppressVCFCommandLines \
    -I data/installtest.bam \
    -O test1.discovery.vcf \
    -run

...

Genotyping


[user@cn3144]$ java -cp ${GENOMESTRIP_CP} -Xmx8g \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVGenotyper.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${GENOMESTRIP_CP} \
    -configFile conf/genstrip_installtest_parameters.txt \
    -tempDir ./tmp \
    -R data/human_b36_chr1.fasta \
    -genomeMaskFile data/human_b36_chr1.svmask.fasta \
    -genderMapFile data/installtest_gender.map \
    -runDirectory test1 \
    -md test1/metadata \
    -disableGATKTraversal \
    -jobLogDir test1/logs \
    -I data/installtest.bam \
    -vcf test1.discovery.vcf \
    -P chimerism.use.correction:false \
    -O test1.genotypes.vcf \
    -run
...

[user@cn3144]$ ls -lh test1.*
-rw-r--r-- 1 user group 13K Aug  6 12:41 test1.discovery.vcf
-rw-r--r-- 1 user group 735 Aug  6 12:41 test1.discovery.vcf.idx
-rw-r--r-- 1 user group 27K Aug  6 12:45 test1.genotypes.vcf
-rw-r--r-- 1 user group 735 Aug  6 12:45 test1.genotypes.vcf.idx

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. genomestrip.sh) that runs a pipeline similar to the following:

#!/bin/bash

module load genomestrip/2.00.1833 || exit 1
cd /lscratch/$SLURM_JOB_ID || exit 1
SV_TMPDIR=./tmpdir
cp $GENOMESTRIP_TEST_DATA/* .

runDir=test1
inputFile=data/installtest.bam
sites=test1.discovery.vcf
genotypes=test1.genotypes.vcf

mx="-Xmx4g"

mkdir -p ${runDir}/logs || exit 1
mkdir -p ${runDir}/metadata || exit 1

java -cp ${GENOMESTRIP_CP} ${mx} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVPreprocess.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${GENOMESTRIP_CP} \
    -configFile conf/genstrip_installtest_parameters.txt \
    -tempDir ${SV_TMPDIR} \
    -R data/human_b36_chr1.fasta \
    -genomeMaskFile data/human_b36_chr1.svmask.fasta \
    -copyNumberMaskFile data/human_b36_chr1.gcmask.fasta \
    -genderMapFile data/installtest_gender.map \
    -runDirectory ${runDir} \
    -md ${runDir}/metadata \
    -disableGATKTraversal \
    -useMultiStep \
    -L 1:61700000-61900000 \
    -reduceInsertSizeDistributions false \
    -computeGCProfiles true \
    -computeReadCounts true \
    -jobLogDir ${runDir}/logs \
    -I ${inputFile} \
    -P chimerism.use.correction:false \
    -run \
    || exit 1

# Run discovery.
java -cp ${GENOMESTRIP_CP} ${mx} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVDiscovery.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${GENOMESTRIP_CP} \
    -configFile conf/genstrip_installtest_parameters.txt \
    -tempDir ${SV_TMPDIR} \
    -R data/human_b36_chr1.fasta \
    -genomeMaskFile data/human_b36_chr1.svmask.fasta \
    -genderMapFile data/installtest_gender.map \
    -runDirectory ${runDir} \
    -md ${runDir}/metadata \
    -disableGATKTraversal \
    -jobLogDir ${runDir}/logs \
    -L 1 \
    -minimumSize 100 \
    -maximumSize 1000000 \
    -suppressVCFCommandLines \
    -I ${inputFile} \
    -O ${sites} \
    -run \
    || exit 1

(grep -v ^##fileDate= ${sites} | grep -v ^##source= | grep -v ^##reference= | diff -q - benchmark/${sites}) \
    || { echo "Error: test results do not match benchmark data"; exit 1; }

# Run genotyping on the discovered sites.
java -cp ${GENOMESTRIP_CP} ${mx} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVGenotyper.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${GENOMESTRIP_CP} \
    -configFile conf/genstrip_installtest_parameters.txt \
    -tempDir ${SV_TMPDIR} \
    -R data/human_b36_chr1.fasta \
    -genomeMaskFile data/human_b36_chr1.svmask.fasta \
    -genderMapFile data/installtest_gender.map \
    -runDirectory ${runDir} \
    -md ${runDir}/metadata \
    -disableGATKTraversal \
    -jobLogDir ${runDir}/logs \
    -I ${inputFile} \
    -vcf ${sites} \
    -P chimerism.use.correction:false \
    -O ${genotypes} \
    -run \
    || exit 1

(grep -v ^##fileDate= ${genotypes} | grep -v ^##source= | grep -v ^##reference= | diff -q - benchmark/${genotypes}) \
    || { echo "Error: test results do not match benchmark data"; exit 1; }

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=4 --mem=10g --gres=lscratch:20 genomestrip.sh