High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Fusionseq on Biowulf & Helix

FusionSeq is a computational framework to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts such as misalignments or random pairing of transcript fragments and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including 8 cancers with and without known rearrangements. Fusionseq was developed by the Gerstein Lab.

Fusionseq requires a resource file (/home/$USER/.fusioneqrc) that sets the location for the bowtie indexes and so on. When you run 'module load fusionseq' to set up the environment for fusionseq, if this file does not exist, the default template will be copied into /home/$USER/.fusionseq. Example below:

Running on Helix

$ module load fusionseq
/home/$USER/.fusionseqrc does not exist; creating it

This is what the resource file looks like:

# --------------------------------- This section is required ---------------------------------
# Location of the bowtie indexes of the human genome and the composite model
# the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
# the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build

# Pointer to the program twoBitToFa part of the blat suite
# Location and filename of the reference genome in 2bit format (to be used by blat)

# Location and name of the transcript composite model sequence and interval files

# location of the annotation files
# conversion of knownGenes to gene symbols, description, etc. 
# conversion of knownGenes to TreeFam

# Location and filename of the ribosomal library

The environment variables in the upper required section have been set to point to the locations of indexes on Biowulf. If you have your own indexes, you may want to modify these variables.

Running a single batch job on Biowulf

1. Create a script file similar to the lines below.


module load fusionseq
cd /data/$USER/dir
geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log
(gfrMitochondrialFilter < data.1.gfr | gfrCountPairTypes | gfrExpressionConsistencyFilter | gfrAbnormalInsertSizeFilter 0.01 | \
gfrPCRFilter 4 4 | gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal | \
gfrAnnotationConsistencyFilter pseudogenes | gfrLargeScaleHomologyFilter | gfrRibosomalFilter | \
gfrSmallScaleHomologyFilter) > data.gfr 2> data.log
gfrConfidenceValues data < data.gfr > data.confidence.gfr
gfr2bpJunctions data.confidence.gfr 40 200 minDASPER

2. Submit the script on biowulf:

$ sbatch jobscript

For more memory requirement (default 4gb), use --mem flag:

$ sbatch --mem=10g jobscript

Running a swarm of jobs on Biowulf

Setup a swarm command file:

  cd /data/$USER/dir1; fusionseq commands
  cd /data/$USER/dir2; fusionseq commands
  cd /data/$USER/dir3; fusionseq commands

Submit the swarm file:

  $ swarm -f swarmfile --module fusionseq

-f: specify the swarmfile name
--module: set environmental variables for each command line in the file

To allocate more memory, use -g flag:

  $ swarm -f swarmfile -g 10 --module fusionseq

-g: allocate more memory

For more information regarding running swarm, see swarm.html

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive 
salloc.exe: Granted job allocation 16535

cn999$ module load fusionseq
cn999$ cd /data/$USER/Examples
cn999$ fusionseq commands

cn999$ exit


Make sure to exit the job once finished.

If more memory is needed, use --mem flag. For example

biowulf$ sinteractive --mem=10g


List of programs and usage


FusionSeq FAQ