High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
FusionCatcher on Biowulf

Description

FusionCatcher searches for novel/known somatic fusion genes, translocations, and chimeras in RNA-seq data (paired-end or single-end reads from Illumina NGS platforms like Solexa/HiSeq/NextSeq/MiSeq) from diseased samples.

Reference

Web sites

On Helix

To prepare your environment for using FusionCatcher, enter the following:

[user@helix ~]$ module load fusioncatcher

The commands fusioncatcher and fusioncatcher-batch will now be available to you. To learn more use the help arguments like so:

[user@helix ~]$ fusioncatcher -h
[user@helix ~]$ fusioncatcher-batch -h 

To begin running a pipeline, fusioncatcher needs the following:

URLs can also be given as input. See the FusionCatcher manual for details.

IMPORTANT NOTE: Memory Usage
FusionCatcher jobs may use a great deal of memory. In some instances memory usage may run into the hundreds of GB range. Large jobs are therefore inappropriate for Helix.

The following example downloads 2 FASTQ files to the user's home directory and then runs a quick test using FusionCatcher to analyze a small portion of the human genome.

[user@helix ~]$ mkdir -p /data/$USER/fusioncatcher/test
[user@helix ~]$ cd /data/$USER/fusioncatcher/test
[user@helix ~]$ wget http://sourceforge.net/projects/fusioncatcher/files/test/reads_1.fq.gz
[user@helix ~]$ wget http://sourceforge.net/projects/fusioncatcher/files/test/reads_2.fq.gz
[user@helix ~]$ cd ..
[user@helix ~]$ fusioncatcher \
        --input ./test/ \
        --output ./test-results/ \
        --threads=2

Note the --threads argument limiting the number of cpus to 2.

Batch jobs on Biowulf

As with any sequence of commands, the last example could be written into a script and submitted to SLURM with the sbatch command. Assuming you already downloaded or copied data to /data/$USER/fusioncatcher/test (see previous example) the script may look something like the following:

#!/bin/bash
# this file is called myjob.sh

module load fusioncatcher
fusioncatcher \
--input /data/$USER/fusioncatcher/test/ \
--output /lscratch/$SLURM_JOBID/test-results/ \
--threads=2
mkdir /data/$USER/fusioncatcher/$SLURM_JOBID
mv /lscratch/$SLURM_JOBID/test-results /data/$USER/fusioncatcher/$SLURM_JOBID/test-results

Submit the job to SLURM with:

[user@helix ~]$ sbatch --mem=20g --gres=lscratch:10 --time=30 myjob.sh

Note that larger jobs may need more memory, lscratch space, and longer walltime. A typical single-sample paired end RNA-Seq run should requred less than 50GB memory and 100GB lscratch space.

IMPORTANT NOTE: Disk Usage
FusionCatcher jobs may use a great deal of disk space for temporary files. In some instances disk usage may run into the hundreds of GB range. If the --output directory is located on network storage, the I/O load can lead to poor performance and affect other users. Local scratch space (lscratch) should therefore be used when jobs are submitted to the batch system (as in the example above). See the Biowulf user guide for more information about using lscratch.

Swarm of jobs on Biowulf

FusionCatcher has a built-in "batch mode" which can be accessed using the fusioncatcher-batch command. This command accepts a filename as input. The file should be tab delimited text listing paired input and output. Interested users should see the FusionCatcher manual for more details. However, users should note that the batch scripts used for fusioncatcher-batch jobs can easily be converted to swarm scripts and run in parallel for great speed increases.

In this section, we will begin by considering a fusioncatcher-batch job and see how to convert it to a swarm job for increased efficiency. An example FusionCatcher batch script that uses ftp URLs as input can be downloaded here. For brevity we will refer to the contents of this file in abbreviated form below.

The batch script looks something like so:

ftp.uk/72.fastq   thyroid
ftp.uk/73.fastq   testis
ftp.uk/74.fastq   ovary
ftp.uk/75.fastq   leukocyte
ftp.uk/76.fastq   skeletal muscle
ftp.uk/77.fastq   prostate
ftp.uk/78.fastq   lymph node
ftp.uk/79.fastq   lung
ftp.uk/80.fastq   adipose
ftp.uk/81.fastq   adrenal
ftp.uk/82.fastq   brain
ftp.uk/83.fastq   breast
ftp.uk/84.fastq   colon
ftp.uk/85.fastq   kidney
ftp.uk/86.fastq   heart
ftp.uk/87.fastq   liver 

The ftp URLs on the left denote input for the batch script, and the names on the right give the locations of directories that should contain output when the script is run.

This syntax can be adapted to make a swarm file:

fusioncatcher -i ftp.uk/72.fastq -o /lscratch/$SLURM_JOBID/thyroid         -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/thyroid 
fusioncatcher -i ftp.uk/73.fastq -o /lscratch/$SLURM_JOBID/testis          -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/testis
fusioncatcher -i ftp.uk/74.fastq -o /lscratch/$SLURM_JOBID/ovary           -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/ovary
fusioncatcher -i ftp.uk/75.fastq -o /lscratch/$SLURM_JOBID/leukocyte       -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/leukocyte
fusioncatcher -i ftp.uk/76.fastq -o /lscratch/$SLURM_JOBID/skeletal_muscle -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/skeletal_muscle
fusioncatcher -i ftp.uk/77.fastq -o /lscratch/$SLURM_JOBID/prostate        -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/prostate
fusioncatcher -i ftp.uk/78.fastq -o /lscratch/$SLURM_JOBID/lymph_node      -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/lymph_node
fusioncatcher -i ftp.uk/79.fastq -o /lscratch/$SLURM_JOBID/lung            -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/lung
fusioncatcher -i ftp.uk/80.fastq -o /lscratch/$SLURM_JOBID/adipose         -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/adipose
fusioncatcher -i ftp.uk/81.fastq -o /lscratch/$SLURM_JOBID/adrenal         -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/adrenal
fusioncatcher -i ftp.uk/82.fastq -o /lscratch/$SLURM_JOBID/brain           -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/brain
fusioncatcher -i ftp.uk/83.fastq -o /lscratch/$SLURM_JOBID/breast          -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/breast
fusioncatcher -i ftp.uk/84.fastq -o /lscratch/$SLURM_JOBID/colon           -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/colon
fusioncatcher -i ftp.uk/85.fastq -o /lscratch/$SLURM_JOBID/kidney          -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/kidney
fusioncatcher -i ftp.uk/86.fastq -o /lscratch/$SLURM_JOBID/heart           -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/heart
fusioncatcher -i ftp.uk/87.fastq -o /lscratch/$SLURM_JOBID/liver           -p 2; \
    mv /lscratch/$SLURM_JOBID/thyroid /data/$USER/liver

Note the substitution of underscores for spaces in the output directories. Note also the -p argument limiting each sub-job to 2 processors. This is appropriate because SLURM will assign each job 2 cpus (1 hyperthreaded core) by default. Assuming that this file is saved as myjobs.swarm, it could be submitted like so:

[user@helix ~]$ swarm -f myjobs.swarm -g 20 --time 20 --module fusioncatcher --gres=lscratch:10

See the swarm webpage for more information, or contact the Biowulf staff at staff@hpc.nih.gov

Documentation