High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
lumpy on Biowulf & Helix

Description

Lumpy detects structural variants from high throughput sequencing data using read-pair, split read, and other signals in parallel.

There are two ways of using lumpy:

Either way, lumpy expects input alignments as created by bwa mem.

There may be multiple versions of lumpy available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail lumpy 

To select a module use

module load lumpy/[version]

where [version] is the version of choice.

lumpy is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

Dependencies

Paths to dependencies are included in in $LUMPY_CONFIG which lumpyexpress will find automatically. Therefore these modules are not loaded automatically in the lumpy module.

References

Documentation

On Helix

Lumpy is a multi-threaded application. Please limit use on helix to 2 threads or better yet use an interactive session.

Set up the environment and run a standard analysis for somatic variants without separately extracting split reads or discordant pairs:

helix$ module load lumpy
helix$ lumpyexpress -t2 -B /path/to/input.bam
Sourcing executables from
/usr/local/apps/lumpy/0.2.11/bin/lumpyexpress.config ...

Checking for required python modules
(/usr/local/Anaconda/envs/py2.7.9/bin/python)...
samblaster: Version 0.1.22
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Opening gcat_set_053.bam.vcf.Kzd6atukx0F1/disc_pipe for write.
samblaster: Opening gcat_set_053.bam.vcf.Kzd6atukx0F1/spl_pipe for write.
samblaster: Loaded 25 header sequence entries.
samblaster: Output 27571 discordant read pairs to gcat_set_053.bam.vcf.Kzd6atukx0F1/disc_pipe
samblaster: Output 0 split reads to gcat_set_053.bam.vcf.Kzd6atukx0F1/spl_pipe
samblaster: Marked 0 of 39432108 (0.00%) read ids as duplicates using 1188k memory in 1M36S(96.300S) C
PU seconds and 19M47S(1187S) wall time.
Removed 5 outliers with isize >= 404
Running LUMPY... 
chrM    1000000
[...snip...]
helix$ lumpyexress -h
usage:   lumpyexpress [options]

options:
     -B FILE  full BAM file(s) (comma separated) (required)
     -S FILE  split reads BAM file(s) (comma separated)
     -D FILE  discordant reads BAM files(s) (comma separated)
     -o FILE  output file [fullBam.bam.vcf]
     -x FILE  BED file to exclude
     -P       output probability curves for each variant
     -m INT   minimum sample weight for a call [4]
     -r FLOAT trim threshold [0]
     -T DIR   temp directory [./output_prefix.XXXXXXXXXXXX]
     -t N     number of threads [1]
     -k       keep temporary files

     -K FILE  path to lumpyexpress.config file
                (default: same directory as lumpyexpress)
     -v       verbose
     -h       show this message

Processing data with lumpy is more flexible. See the GitHub page for more detailed documentation and examples.

Batch job on Biowulf

Create a batch script similar to the following

#! /bin/bash

module load lumpy || exit 1
lumpyexpress -t $SLURM_CPUS_PER_TASK -B input.bam -o output.vcf

Submit to the batch queue with

b2$ sbatch --cpus-per-task 10 --mem=10G lumpy_batch.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following

lumpyexpress -t $SLURM_CPUS_PER_TASK -B input1.bam -o output1.vcf
lumpyexpress -t $SLURM_CPUS_PER_TASK -B input2.bam -o output2.vcf
lumpyexpress -t $SLURM_CPUS_PER_TASK -B input3.bam -o output3.vcf

Submit to the batch queue with

b2$ swarm -t 10 -g 10 lumpy.swarm
Interactive job on Biowulf

Allocate in interactive session and use as described above

b2$ sinteractive --cpus-per-task=6 --mem=10G
[..snip...]
cn0054$ module load lumpy
cn0054$ lumpyexpress -t 6 -B input.bam -o output.vcf
[...snip...]
cn0054$ exit
b2$
Documentation