FuSeq: discovering fusion genes from paired-end RNA sequencing data
FuSeq is a fast and accurate method to discover fusion genes based on quasi-mapping to quickly map the reads, extract initial candidates from split reads and fusion equivalence classes of mapped reads, and finally apply multiple filters and statistical tests to get the final candidates.
References:
-
Trung Nghia Vu, Wenjiang Deng, Quang Thinh Trac, Stefano Calza, Woochang Hwang and Yudi Pawitan.
A fast detection of fusion genes from paired-end RNA-seq data.
BMC Genomics, 2018, 19, p.786. https://doi.org/10.1186/s12864-018-5156-1.
Documentation
Important Notes
- Module Name: FuSeq (see the modules page for more information)
- Unusual environment variables set
- FUSEQ_HOME FuSeq installation directory
- FUSEQ_BIN FuSeq executable folder
- FUSEQ_DATA sample data for running FuSeq
- FUSEQ_R FuSeq R source files directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=4g --gres=lscratch:10 [user@cn3144 ~]$ module load FuSeq [+] Loading gcc 9.2.0 ... [+] Loading GSL 2.6 for GCC 9.2.0 ... [+] Loading openmpi 3.1.4 for GCC 9.2.0 [+] Loading ImageMagick 7.0.8 on cn2367 [+] Loading HDF5 1.10.4 [+] Loading NetCDF 4.7.4_gcc9.2.0 [+] Loading pandoc 2.10.1 on cn2367 [+] Loading pcre2 10.21 ... [+] Loading R 4.0.0 [+] Loading FuSeq / 1.1.3 ...Create soft links to the sample read data:
[user@cn3144 ~]$ lndir $FUSEQ_DATA ./Index the sample transcript data:
[user@cn3144 ~]$ TxIndexer -t Homo_sapiens.GRCh37.75.cdna.all.fa -o TxIndexer_idx --------------------------------------------------------------- --- FuSeq acknowledges the Sailfish and Rapmap for this indexer --- --------------------------------------------------------------- writing log to TxIndexer_idx/LogDir/FuSeq_index.log RapMap Indexer [Step 1 of 4] : counting k-mers counted k-mers for 180000 transcriptsElapsed time: 4.19777s Replaced 0 non-ATCG nucleotides Clipped poly-A tails from 1401 transcripts Building rank-select dictionary and saving to disk done Elapsed time: 0.0257811s Writing sequence data to file . . . done Elapsed time: 0.137891s [info] Building 32-bit suffix array (length of generalized text is 287322183) Building suffix array . . . success saving to disk . . . done Elapsed time: 0.440665s done Elapsed time: 41.0055s processed 287000000 positions khash had 102501816 keys saving hash to disk . . . done Elapsed time: 9.4689sExtract fusion equivalence classes and split reads by running FuSeq with 16 cpus:
[user@cn3144 ~]$ FuSeq -i TxIndexer_idx -l IU -1 read1.fa -2 read2.fa -p 16 -g $FUSEQ_REF/Homo_sapiens.GRCh38.95.gtf -o feqDir Logs will be written to feqDir/LogDir [2020-09-29 08:25:15.205] [jointLog] [info] parsing read library format there is 1 lib Loading 32-bit quasi index[2020-09-29 08:25:16.126] [stderrLog] [info] Loading Suffix Array [2020-09-29 08:25:16.125] [jointLog] [info] Loading Quasi index [2020-09-29 08:25:17.680] [stderrLog] [info] Loading Transcript Info [2020-09-29 08:25:18.080] [stderrLog] [info] Loading Rank-Select Bit Array [2020-09-29 08:25:18.224] [stderrLog] [info] There were 187626 set bits in the bit array [2020-09-29 08:25:18.282] [stderrLog] [info] Computing transcript lengths [2020-09-29 08:25:18.283] [stderrLog] [info] Waiting to finish loading hash Index contained 187626 targets [2020-09-29 08:25:25.486] [stderrLog] [info] Successfully loaded position hash [2020-09-29 08:25:25.486] [stderrLog] [info] Done loading index Loaded targets [FuSeq] -- Use transcript-gene map for fitering fusion-equivalence classes [2020-09-29 08:25:25.486] [jointLog] [info] done processed 18000000 fragments hits: 73000189, hits per frag (may not be concordant): 4.05654 [2020-09-29 08:26:46.624] [jointLog] [info] Gathered fragment lengths from all threads [2020-09-29 08:26:46.627] [jointLog] [warning] Sailfish saw fewer then 10000 uniquely mapped reads so 200 will be used as the mean fragment length and 80 as the standard deviation for effective length correction [2020-09-29 08:26:46.633] [jointLog] [info] Estimating effective lengths Done Quasi-Mapping [2020-09-29 08:26:46.956] [jointLog] [info] Computed 287469 rich equivalence classes for further processing [2020-09-29 08:26:46.956] [jointLog] [info] Counted 13490516 total reads in the equivalence classes [2020-09-29 08:26:46.980] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2020-09-29 08:26:46.980] [jointLog] [info] Counted 0 total reads in the equivalence classes [2020-09-29 08:26:47.005] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2020-09-29 08:26:47.005] [jointLog] [info] Counted 0 total reads in the equivalence classes [2020-09-29 08:26:47.029] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2020-09-29 08:26:47.029] [jointLog] [info] Counted 0 total reads in the equivalence classes [2020-09-29 08:26:47.054] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2020-09-29 08:26:47.054] [jointLog] [info] Counted 0 total reads in the equivalence classes [2020-09-29 08:26:47.079] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2020-09-29 08:26:47.079] [jointLog] [info] Counted 0 total reads in the equivalence classes [2020-09-29 08:26:47.103] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2020-09-29 08:26:47.103] [jointLog] [info] Counted 0 total reads in the equivalence classes [2020-09-29 08:26:47.104] [jointLog] [info] Start to export fusion-equivalence classes: ========================================================================= [FuSeq] -- We are export some data here -- ========================================================================= [FuSeq] -- Export fragment information Read length 50 fragLengthMedian 201 fragLengthMean 201.157 fragLengthSd 77.7927Observed Fragments 18145504 Mapped Fragments 13490516 Total hits 73594484 kmer 21 [FuSeq] -- Extracting equivalence class [FuSeq] -- Extracting RR fusion equivalence classes [FuSeq] -- Extracting RF fusion equivalence classes [FuSeq] -- Extracting FF fusion equivalence classes [FuSeq] -- Extracting FR fusion equivalence classes [FuSeq] -- Extracting UN fusion equivalence classes [FuSeq] -- Extracting ALL fusion equivalence classes [2020-09-29 08:26:51.392] [jointLog] [info] Finished exporting fusion-equivalence classes- a folder feqDir will be produced.
Finally, discover the fusion genes. This will be done in two steps.
First, create .sqlite file containing the annotation file:
[user@cn3144 ~]$ Rscript $FUSEQ_R/createSqlite.R $FUSEQ_REF/Homo_sapiens.GRCh38.95.gtf Homo_sapiens.GRCh37.75.sqlite Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colMeans, colnames, colSums, dirname, do.call, duplicated, eval, evalq, anyDuplicated, append, as.data.frame, basename, cbind, colMeans, colnames, colSums, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which, which.max, which.min Loading required package: S4Vectors Loading required package: stats4 Attaching package: ‘S4Vectors’ The following object is masked from ‘package:base’: expand.grid Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: GenomicRanges Loading required package: AnnotationDbi Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Import genomic features from the file as a GRanges object ... ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .get_cds_IDX(type, phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: Link to the source # Organism: Homo sapiens # Taxonomy ID: 9606 # miRBase build ID: NA # Genome: NA # transcript_nrow: 206601 # exon_nrow: 702523 # cds_nrow: 273970 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2019-01-17 08:11:23 -0500 (Thu, 17 Jan 2019) # GenomicFeatures version at creation time: 1.34.1 # RSQLite version at creation time: 2.1.1 # DBSCHEMAVERSION: 1.2Second, use the file Homo_sapiens.GRCh37.75.sqliteR, together with the supporting annotation file Homo_sapiens.GRCh37.75.txAnno.RData containing information of paralogs, gene types, etc., to perform the discovery of fusion genes:
[user@cn3144 ~]$ Rscript $FUSEQ_R/FuSeq.R in=$FUSEQ_DATA/Homo_sapiens.GRCh37.75.cdna.all.fa sqlite=$FUSEQ_DATA/Homo_sapiens.GRCh37.75.sqlite txanno=$FUSEQ_DATA/Homo_sapiens.GRCh37.75.txAnno.RData Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: S4Vectors Loading required package: stats4 Attaching package: ‘S4Vectors’ The following object is masked from ‘package:base’: expand.grid Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: GenomicRanges Loading required package: AnnotationDbi Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: Link to the source # Organism: Homo sapiens # Taxonomy ID: 9606 # miRBase build ID: NA # Genome: NA # Nb of transcripts: 206601 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2021-11-04 09:19:51 -0400 (Thu, 04 Nov 2021) # GenomicFeatures version at creation time: 1.42.3 # RSQLite version at creation time: 2.2.7 # DBSCHEMAVERSION: 1.2[user@cn3144 ~]$ exit exit salloc.exe: Relinquishing job allocation 18320304 [user@biowulf ~]$