Biowulf High Performance Computing at the NIH
FuSeq: discovering fusion genes from paired-end RNA sequencing data

FuSeq is a fast and accurate method to discover fusion genes based on quasi-mapping to quickly map the reads, extract initial candidates from split reads and fusion equivalence classes of mapped reads, and finally apply multiple filters and statistical tests to get the final candidates.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=4g --gres=lscratch:10 
[user@cn3144 ~]$ module load FuSeq 
[+] Loading gcc  9.2.0  ...
[+] Loading GSL 2.6 for GCC 9.2.0 ...
[+] Loading openmpi 3.1.4  for GCC 9.2.0
[+] Loading ImageMagick  7.0.8  on cn2367
[+] Loading HDF5  1.10.4
[+] Loading NetCDF 4.7.4_gcc9.2.0
[+] Loading pandoc  2.10.1  on cn2367
[+] Loading pcre2 10.21  ...
[+] Loading R 4.0.0
[+] Loading FuSeq / 1.1.3  ...
Create soft links to the sample read data:
[user@cn3144 ~]$ lndir $FUSEQ_DATA ./ 
Index the sample transcript data:
 
[user@cn3144 ~]$ TxIndexer -t Homo_sapiens.GRCh37.75.cdna.all.fa -o TxIndexer_idx
---------------------------------------------------------------

--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---

---------------------------------------------------------------

writing log to TxIndexer_idx/LogDir/FuSeq_index.log
RapMap Indexer

[Step 1 of 4] : counting k-mers
counted k-mers for 180000 transcriptsElapsed time: 4.19777s

Replaced 0 non-ATCG nucleotides
Clipped poly-A tails from 1401 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.0257811s
Writing sequence data to file . . . done
Elapsed time: 0.137891s
[info] Building 32-bit suffix array (length of generalized text is 287322183)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 0.440665s
done
Elapsed time: 41.0055s
processed 287000000 positions
khash had 102501816 keys
saving hash to disk . . . done
Elapsed time: 9.4689s
Extract fusion equivalence classes and split reads by running FuSeq with 16 cpus:
[user@cn3144 ~]$ FuSeq -i TxIndexer_idx -l IU -1 read1.fa -2 read2.fa -p 16 -g $FUSEQ_REF/Homo_sapiens.GRCh38.95.gtf -o feqDir 
Logs will be written to feqDir/LogDir
[2020-09-29 08:25:15.205] [jointLog] [info] parsing read library format
there is 1 lib
Loading 32-bit quasi index[2020-09-29 08:25:16.126] [stderrLog] [info] Loading Suffix Array
[2020-09-29 08:25:16.125] [jointLog] [info] Loading Quasi index
[2020-09-29 08:25:17.680] [stderrLog] [info] Loading Transcript Info
[2020-09-29 08:25:18.080] [stderrLog] [info] Loading Rank-Select Bit Array
[2020-09-29 08:25:18.224] [stderrLog] [info] There were 187626 set bits in the bit array
[2020-09-29 08:25:18.282] [stderrLog] [info] Computing transcript lengths
[2020-09-29 08:25:18.283] [stderrLog] [info] Waiting to finish loading hash
Index contained 187626 targets
[2020-09-29 08:25:25.486] [stderrLog] [info] Successfully loaded position hash
[2020-09-29 08:25:25.486] [stderrLog] [info] Done loading index
Loaded targets

 [FuSeq] -- Use transcript-gene map for fitering fusion-equivalence classes
[2020-09-29 08:25:25.486] [jointLog] [info] done

processed 18000000 fragments
hits: 73000189, hits per frag (may not be concordant):  4.05654
[2020-09-29 08:26:46.624] [jointLog] [info] Gathered fragment lengths from all threads
[2020-09-29 08:26:46.627] [jointLog] [warning] Sailfish saw fewer then 10000 uniquely mapped reads so 200 will be used as the mean fragment length and 80 as the standard deviation for effective length correction
[2020-09-29 08:26:46.633] [jointLog] [info] Estimating effective lengths
Done Quasi-Mapping

[2020-09-29 08:26:46.956] [jointLog] [info] Computed 287469 rich equivalence classes for further processing
[2020-09-29 08:26:46.956] [jointLog] [info] Counted 13490516 total reads in the equivalence classes
[2020-09-29 08:26:46.980] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2020-09-29 08:26:46.980] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2020-09-29 08:26:47.005] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2020-09-29 08:26:47.005] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2020-09-29 08:26:47.029] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2020-09-29 08:26:47.029] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2020-09-29 08:26:47.054] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2020-09-29 08:26:47.054] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2020-09-29 08:26:47.079] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2020-09-29 08:26:47.079] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2020-09-29 08:26:47.103] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2020-09-29 08:26:47.103] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2020-09-29 08:26:47.104] [jointLog] [info] Start to export fusion-equivalence classes:

=========================================================================

[FuSeq] -- We are export some data here --

=========================================================================

[FuSeq] -- Export fragment information

Read length 50 fragLengthMedian 201 fragLengthMean 201.157 fragLengthSd 77.7927Observed Fragments 18145504 Mapped Fragments 13490516 Total hits 73594484 kmer 21

[FuSeq] -- Extracting equivalence class

[FuSeq] -- Extracting RR fusion equivalence classes

[FuSeq] -- Extracting RF fusion equivalence classes

[FuSeq] -- Extracting FF fusion equivalence classes

[FuSeq] -- Extracting FR fusion equivalence classes

[FuSeq] -- Extracting UN fusion equivalence classes

[FuSeq] -- Extracting ALL fusion equivalence classes

[2020-09-29 08:26:51.392] [jointLog] [info] Finished exporting fusion-equivalence classes
- a folder feqDir will be produced.

Finally, discover the fusion genes. This will be done in two steps.
First, create .sqlite file containing the annotation file:
[user@cn3144 ~]$ Rscript $FUSEQ_R/createSqlite.R $FUSEQ_REF/Homo_sapiens.GRCh38.95.gtf Homo_sapiens.GRCh37.75.sqlite 
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges

Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Import genomic features from the file as a GRanges object ...
... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(type, phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: Link to the source
# Organism: Homo sapiens
# Taxonomy ID: 9606
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 206601
# exon_nrow: 702523
# cds_nrow: 273970
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-01-17 08:11:23 -0500 (Thu, 17 Jan 2019)
# GenomicFeatures version at creation time: 1.34.1
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2
Second, use the file Homo_sapiens.GRCh37.75.sqliteR, together with the supporting annotation file Homo_sapiens.GRCh37.75.txAnno.RData containing information of paralogs, gene types, etc., to perform the discovery of fusion genes:
[user@cn3144 ~]$ Rscript $FUSEQ_R/FuSeq.R in=$FUSEQ_DATA/Homo_sapiens.GRCh37.75.cdna.all.fa sqlite=$FUSEQ_DATA/Homo_sapiens.GRCh37.75.sqlite  txanno=$FUSEQ_DATA/Homo_sapiens.GRCh37.75.txAnno.RData 

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: Link to the source
# Organism: Homo sapiens
# Taxonomy ID: 9606
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 206601
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-11-04 09:19:51 -0400 (Thu, 04 Nov 2021)
# GenomicFeatures version at creation time: 1.42.3
# RSQLite version at creation time: 2.2.7
# DBSCHEMAVERSION: 1.2
[user@cn3144 ~]$ exit exit salloc.exe: Relinquishing job allocation 18320304 [user@biowulf ~]$