High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Fusionseq on Biowulf & Helix

FusionSeq is a computational framework to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts such as misalignments or random pairing of transcript fragments and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including 8 cancers with and without known rearrangements. Fusionseq was developed by the Gerstein Lab.

Fusionseq requires a resource file (/home/$USER/.fusioneqrc) that sets the location for the bowtie indexes and so on. When you run 'module load fusionseq' to set up the environment for fusionseq, if this file does not exist, the default template will be copied into /home/$USER/.fusionseq. Example below:

Running on Helix

$ module load fusionseq
/home/$USER/.fusionseqrc does not exist; creating it

This is what the resource file looks like:

# --------------------------------- This section is required ---------------------------------
# Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/fdb/fusionseq/indexes" 
# the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg19_nh"
# the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg19_knownGeneAnnotationTranscriptCompositeModel"

# Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/usr/local/apps/blat/3.5/twoBitToFa" 
# Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/fdb/fusionseq/indexes/hg19_nh"
BLAT_TWO_BIT_DATA_FILENAME="hg19.2bit"

# Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/fdb/fusionseq/indexes/FusionSeq_Data"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt" 

# location of the annotation files
ANNOTATION_DIR="/fdb/fusionseq/indexes/FusionSeq_Data" 
# conversion of knownGenes to gene symbols, description, etc. 
KNOWN_GENE_XREF_FILENAME="kgXref.txt" 
# conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt" 

# Location and filename of the ribosomal library
RIBOSOMAL_DIR="/fdb/fusionseq/indexes/FusionSeq_Data"
RIBOSOMAL_FILENAME="ribosomal.2bit"

The environment variables in the upper required section have been set to point to the locations of indexes on Biowulf. If you have your own indexes, you may want to modify these variables.

Running a single batch job on Biowulf

1. Create a script file similar to the lines below.

#!/bin/bash

module load fusionseq
cd /data/$USER/dir
geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log
(gfrMitochondrialFilter < data.1.gfr | gfrCountPairTypes | gfrExpressionConsistencyFilter | gfrAbnormalInsertSizeFilter 0.01 | \
gfrPCRFilter 4 4 | gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal | \
gfrAnnotationConsistencyFilter pseudogenes | gfrLargeScaleHomologyFilter | gfrRibosomalFilter | \
gfrSmallScaleHomologyFilter) > data.gfr 2> data.log
gfrConfidenceValues data < data.gfr > data.confidence.gfr
gfr2bpJunctions data.confidence.gfr 40 200 minDASPER

2. Submit the script on biowulf:

$ sbatch jobscript

For more memory requirement (default 4gb), use --mem flag:

$ sbatch --mem=10g jobscript

Running a swarm of jobs on Biowulf

Setup a swarm command file:

  cd /data/$USER/dir1; fusionseq commands
  cd /data/$USER/dir2; fusionseq commands
  cd /data/$USER/dir3; fusionseq commands
	[......]

Submit the swarm file:

  $ swarm -f swarmfile --module fusionseq

-f: specify the swarmfile name
--module: set environmental variables for each command line in the file

To allocate more memory, use -g flag:

  $ swarm -f swarmfile -g 10 --module fusionseq

-g: allocate more memory

For more information regarding running swarm, see swarm.html

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive 
salloc.exe: Granted job allocation 16535

cn999$ module load fusionseq
cn999$ cd /data/$USER/Examples
cn999$ fusionseq commands

cn999$ exit
exit

biowulf$

Make sure to exit the job once finished.

If more memory is needed, use --mem flag. For example

biowulf$ sinteractive --mem=10g

Documentation

List of programs and usage

http://info.gersteinlab.org/FusionSeq

FusionSeq FAQ