High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
deFuse on Biowulf2 & Helix

Description

deFuse uses clusters of discordant paired end reads to guide split read alignments across gene-gene fusion boundaries in RNA-Seq data. Filters are applied to reduce false positives and results are annotated.

Reference data sets required by deFuse are stored under

/fdb/defuse

Note that the versions of gmap/gsnap available on biowulf make use of a new index format. Only reference data sets ending in _newgmap are compatible with these versions of gmap.

deFuse runs are set up using a configuration file which may change between defuse versions. Use the configuration file included with the version you are using as a starting point for your analysis. They can be found under

/usr/local/apps/defuse/[version]/config_*.txt

There are multiple versions of deFuse available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail defuse 

To select a module use

module load defuse/[version]

where [version] is the version of choice.

deFuse is a pipeline that makes use of internal and external tools. Pipeline steps can be run on the same machine as the main driver script (defuse.pl -s direct ...) or submitted to compute nodes ( defuse.pl -s slurm ...). The -p option determines how many jobs are run in parallel.

defuse.pl option summary (for version 0.8.0):

Usage: defuse.pl [options]
Run the deFuse pipeline for fusion discovery.
  -h, --help      Displays this information
  -c, --config    Configuration Filename
  -d, --dataset   Dataset Directory
  -o, --output    Output Directory
  -r, --res       Main results filename (default: results.tsv 
                  in Output Directory)
  -a, --rescla    Results with a probability column filename 
                  (default: results.classify.tsv in Output Directory)
  -b, --resfil    Filtered by the probability threshold results filename 
                  (default: results.filtered.tsv in Output Directory)
  -1, --1fastq    Fastq filename 1
  -2, --2fastq    Fastq filename 2
  -n, --name      Library Name (default: Output Directory Suffix)
  -l, --local     Job Local Directory (default: Output Directory)
  -s, --submit    Submitter Type (default: direct)
  -p, --parallel  Maximum Number of Parallel Jobs (default: 1)

Note that the driver script was renamed to defuse_run.pl in version 0.8.0. However, defuse.pl is still available as a symbolic link. Note also that starting with version 0.8.0 the dataset directory has to be provided on the command line with -d

Environment variables set

Dependencies

References

Documentation

On Helix
In order to achieve acceptable run times, deFuse should be allowed to start multiple parallel processes either by submitting to compute nodes or by running on the same machine as the main deFuse script. Therefore running on helix is not recommended.
Batch job on Biowulf

A deFuse batch job can run in two different ways - either all jobs started by the main defuse.pl are run on the same compute node, or they are submitted to other nodes via slurm.

Here is an example script that will run all jobs on the same node as the node running the main deFuse script. This makes use of a small data set of simulated RNASeq reads. Note that bowtie is allowed 2 threads in the sample config file, so the number of parallel jobs is limited to half the number of CPUs.

#! /bin/bash
# filename: small.sh
set -e

DATA=/usr/local/apps/defuse/TEST_DATA/small
DEFUSE_VER=0.8.0
module load defuse/$DEFUSE_VER || exit 1

cp -r $DATA . || exit 1
cp /usr/local/apps/defuse/$DEFUSE_VER/config_hg19_ens69.txt config.txt

defuse.pl -c config.txt -o small.out \
    -d /fdb/defuse/hg19_ens69_newgmap \
    -1 small/rna/spiked.1.fastq -2 small/rna/spiked.2.fastq \
    -s direct -p $(( SLURM_CPUS_PER_TASK / 2 ))

The batch file is submitted to the queue with a command similar to the following:

biowulf$ sbatch --cpus-per-task=20 small.sh

The other approach is shown in the following batch script with runs the main script on a compute node with just 2 CPUs allocated. The main script in turn submits subjobs via slurm. This example uses data obtained from the Gerstein lab for cell line NCI-H660, which contains a known TMPRSS2-ERG fusion.

#! /bin/bash
# this file is large.sh

# defuse version
DEFUSE_VER=0.8.0
module load defuse/$DEFUSE_VER || exit 1
cp /usr/local/apps/defuse/$DEFUSE_VER/config_hg19_ens69.txt config.txt

# large test data - copy if it doesn't already exist
if [[ ! -d large ]]; then
    cd large
    DATA=/usr/local/apps/defuse/TEST_DATA/NCIH660
    cp $DATA/NCIH660.fastq.tar.gz .
    tar -xzf NCIH660.fastq.tar.gz
    rm NCIH660.fastq.tar.gz
    cd ..
fi

defuse.pl -c config.txt -o ncih660.out \
    -d /fdb/defuse/hg19_ens69_newgmap \
    -1 large/NCIH660_1.fastq \
    -2 large/NCIH660_2.fastq \
    -s slurm -p 25

Which is submitted as follows

biowulf$ sbatch large.sh
Swarm of jobs on Biowulf2

To set up a swarm of defuse jobs, each running the subjobs in local mode, use a swarm file like this:

defuse.pl -c config.txt -o defuse1.out \
  -d /fdb/defuse/hg19_ens69_newgmap \
  -1 defuse1.1.fastq \
  -2 defuse1.2.fastq \
  -s direct -p 12 
defuse.pl -c config.txt -o defuse2.out \
  -d /fdb/defuse/hg19_ens69_newgmap \
  -1 defuse2.1.fastq \
  -2 defuse2.2.fastq \
  -s direct -p 12 
[...]

Then submit the swarm, requesting 24 CPUs and 10GB memory for each task

biowulf$ swarm -g 10 -t 24 swarmfile

See the swarm documentation for more details.

Interactive job on Biowulf

For preliminary experiments or debugging it may be useful to work with deFuse interactively. Because of the way deFuse works, this should be done on an interactively allocated compute node. For example:

biowulf$ sinteractive --cpus-per-task=24 --mem=20G
salloc.exe: Granted job allocation 240602
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn0044 are ready for job
cn0044$ DATA=/usr/local/apps/defuse/TEST_DATA/small
cn0044$ module load defuse
cn0044$ cp /usr/local/apps/defuse/config_hg19_ens69.txt config.txt
cn0044$ defuse.pl -c config.txt -o small.out \
  -d /fdb/defuse/hg19_ens69_newgmap \
  -1 $DATA/rna/spiked.1.fastq -2 $DATA/rna/spiked.2.fastq \
  -s direct -p 12
[...]
cn0044$ exit
salloc.exe: Relinquishing job allocation 240602
biowulf$