FuSeq: discovering fusion genes from paired-end RNA sequencing data

FuSeq is a fast and accurate method to discover fusion genes based on quasi-mapping to quickly map the reads, extract initial candidates from split reads and fusion equivalence classes of mapped reads, and finally apply multiple filters and statistical tests to get the final candidates.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=4g --gres=lscratch:10 
[user@cn3144 ~]$ module load FuSeq 
[+] Loading gcc  9.2.0  ...
[+] Loading GSL 2.6 for GCC 9.2.0 ...
[+] Loading openmpi 3.1.4  for GCC 9.2.0
[+] Loading ImageMagick  7.0.8  on cn2367
[+] Loading HDF5  1.10.4
[+] Loading NetCDF 4.7.4_gcc9.2.0
[+] Loading pandoc  2.10.1  on cn2367
[+] Loading pcre2 10.21  ...
[+] Loading R 4.0.0
[+] Loading FuSeq / 1.1.3  ...
Create soft links to the sample read data:
[user@cn3144 ~]$ cp -r $FUSEQ_DATA ./ 
Index the sample transcript data if needed, GRCh37_release75 and GRCh38_release106 are already indexed at /fdb/fuseq/:
[user@cn3144 ~]$ TxIndexer -t Homo_sapiens.GRCh37.75.cdna.all.fa -o TxIndexer_idx

--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---


writing log to TxIndexer_idx/LogDir/FuSeq_index.log
RapMap Indexer

[Step 1 of 4] : counting k-mers
counted k-mers for 180000 transcriptsElapsed time: 4.19777s

Replaced 0 non-ATCG nucleotides
Clipped poly-A tails from 1401 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.0257811s
Writing sequence data to file . . . done
Elapsed time: 0.137891s
[info] Building 32-bit suffix array (length of generalized text is 287322183)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 0.440665s
Elapsed time: 41.0055s
processed 287000000 positions
khash had 102501816 keys
saving hash to disk . . . done
Elapsed time: 9.4689s
Extract fusion equivalence classes and split reads by running FuSeq with 16 cpus:
[user@cn3144 ~]$ FuSeq -i /fdb/fuseq/TxIndexer_idx/GRCh37.75/ \
  -l IU -1 SRR064287_1.fastq -2 SRR064287_2.fastq -p 16 \
    -g $FUSEQ_REF/Homo_sapiens.GRCh37.gtf -o feqDir

- a folder feqDir will be produced.

Finally, discover the fusion genes. This will be done in two steps.
First, create .sqlite file containing the annotation file:
[user@cn3144 ~]$ Rscript $FUSEQ_R/createSqlite.R $FUSEQ_REF/Homo_sapiens.GRCh37.gtf Homo_sapiens.GRCh37.sqlite 
Second, use the file Homo_sapiens.GRCh37.sqlite, together with the supporting annotation file Homo_sapiens.GRCh37.txAnno.RData containing information of paralogs, gene types, etc., to perform the discovery of fusion genes:
[user@cn3144 ~]$ Rscript $FUSEQ_R/FuSeq.R in=feqDir \
   txfasta=$FUSEQ_REF/Homo_sapiens.GRCh37.cdna.all.fa \
      sqlite=Homo_sapiens.GRCh37.sqlite \
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 18320304
[user@biowulf ~]$