Biowulf High Performance Computing at the NIH
ACFS: de novo circRNA identification on Biowulf

ACFS is an Accurate CircRNA Finder Suite for discovering circRNAs from RNA-Seq data. CircRNAs are generated through splicing, or to be precise, back-splicing where the downstream splice donor attacks an upstream splice acceptor. Identifying the exact site of back-splice lies in the heart of circRNA discovery. No prior knowledge of gene annotation is needed for circRNA prediection. ACFS is designed for Single-end RNA-Seq reads. Paired-end data is also supported, albeit with lower sensitivity.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --cpus-per-task=30 --mem=45g
[user@@cn3316 ~]$ module load ACFS
Example of processing single-end RNA-Seq Raw data (from mouse).

(The steps 1) through 6) described below are the data pre-processing steps, whereas the step 7) corresponds to an actual run of the ACFS pipeline.)

1) Download sample data from the Short Read Archive:
[user@@cn3316 ~]$ wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1772422/SRR1772422.sra

2) Produce a fastq file:
[user@@cn3316 ~]$ acfs fastq-dump.2 -fasta 0 --split-files SRR1772422.sra
NOTE: all the commands starting with acfs employ the software installed inside the ACFS singularity container.

The previous command produces file SRR1772422_1.fasta, such that
[user@@cn3316 ~]$  head SRR1772422_1.fasta 
>SRR1772422.1 DBV2SVN1:140:D1N4GACXX:6:1101:1221:2191 length=101
CAGCCAGGAATGCAGGCAACATAATCAAGGTGGGCTGGCAAAAGTGGTCTAGTCAAAGGATGAGGGAAGACTGCTCCCTTCACTCCCGCTTCTTGTAGCTC
>SRR1772422.2 DBV2SVN1:140:D1N4GACXX:6:1101:1199:2234 length=101
GTCCTTTTTGTACAATAGGAGTGTGGTGGCCTTGGTAGGTTCCTTCACGAATTACGTCTCGTCATCATTGATATATTGTGAGGATATTGGTGAGTAGGCCA
...
3) Change fasta/fastq header format to allow processing multiple samples in one run:
[user@@cn3316 ~]$ acfs perl /acfs/change_fastq_header.pl SRR1772422_1.fasta SRR1772422.fa Truseq_HippoSyn
File SRR1772422.fa will be produced, such that
[user@@cn3316 ~]$ head SRR1772422.fa
>Truseq_HippoSyn_SRR1772422.1
CAGCCAGGAATGCAGGCAACATAATCAAGGTGGGCTGGCAAAAGTGGTCTAGTCAAAGGATGAGGGAAGACTGCTCCCTTCACTCCCGCTTCTTGTAGCTC
>Truseq_HippoSyn_SRR1772422.2
GTCCTTTTTGTACAATAGGAGTGTGGTGGCCTTGGTAGGTTCCTTCACGAATTACGTCTCGTCATCATTGATATATTGTGAGGATATTGGTGAGTAGGCCA
...
4) Merge sequences from multiple fasta/fastq files into one fasta file, which saves time for mapping.
[user@@cn3316 ~]$ acfs perl /acfs/Truseq_merge_unique_fa.pl UNMAP SRR1772422.fa
Files UNMAP and UNMAP_expr will be produced:
[user@@cn3316 ~]$ head UNMAP
>newid-1__185127
GGGGTCTCGCTATGTTGCCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACC
>newid-2__178431
CGGGGTCTCGCTATGTTGCCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGAC
...
[user@@cn3316 ~]$ head UNMAP_expr
newid   HippoSyn
newid-1__185127 185127
newid-2__178431 178431
newid-3__93111  93111
newid-4__92508  92508
...
5) Build BWA index (this has been done already)
[user@@cn3316 ~]$ acfs bwa index /fdb/ACFS/Mus_musculus/Ensembl/NCBIM37/Sequence/BWAIndex/genome.fa
(files genome.fa.amb, genome.fa.ann, genome.fa.bwt, genome.fa.pac and genome.fa.sa will be produced in folder /fdb/ACFS/Mus_musculus/Ensembl/NCBIM37/Sequence/BWAIndex).

6) Prepare for annotation (this has been done already)
[user@@cn3316 ~]$ acfs perl /acfs/get_split_exon_border_biotype_genename.pl /fdb/ACFS/Mus_musculus/Ensembl/NCBIM37/Annotation/Genes/genes.gtf /fdb/ACFS/Mus_musculus/Ensembl/NCBIM37/Annotation/Genes/genes_split_exon.gtf
(file genes_split_exon.gtf is produced in folder /fdb/ACFS/Mus_musculus/Ensembl/NCBIM37/Annotation/Genes).

Finally,
7) Run the pipeline
[user@@cn3316 ~]$ acfs perl /acfs/ACF_MAKE.pl  $ACFS_TEST/SPEC_Mus_musculus_Ensembl_NCBIM37.txt BASH_Mus_musculus_Ensembl_NCBIM37.sh 
[user@@cn3316 ~]$ acfs bash BASH_Mus_musculus_Ensembl_NCBIM37.sh 
The results will be stored in files
circle_candidates_MEA.bed12
circle_candidates_CBR.bed12
The paired-end RNA-Seq Raw data can be processed in similar steps. Example below illustrates this using a human sample generated by unstranded RNA-Seq experiment.

1) Download sample data:
[user@@cn3316 ~]$ wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR650/SRR650317/SRR650317.sra
2) Produce SRR650317_1.fasta and SRR650317_2.fasta
[user@@cn3316 ~]$ acfs fastq-dump.2 --fasta 0 --split-files SRR650317.sra 
3) Produce the reverse complements SRR650317_1.fasta.rc and SRR650317_2.fasta.rc,
then SRR650317_1.fa, SRR650317_2.fa, SRR650317_1.fa.rc and SRR650317_2.fa.rc:
[user@@cn3316 ~]$ acfs perl /acfs/reverse_complement.pl SRR650317_1.fasta 
[user@@cn3316 ~]$ acfs perl /acfs/reverse_complement.pl SRR650317_2.fasta 
[user@@cn3316 ~]$ acfs perl /acfs/change_fastq_header.pl SRR650317_1.fasta SRR650317_1.fa Truseq_SRR650317left
[user@@cn3316 ~]$ acfs perl /acfs/change_fastq_header.pl SRR650317_2.fasta SRR650317_2.fa Truseq_SRR650317right 
[user@@cn3316 ~]$ acfs perl /acfs/change_fastq_header.pl SRR650317_1.fasta.rc SRR650317_1.fa.rc Truseq_SRR650317left 
[user@@cn3316 ~]$ acfs perl /acfs/change_fastq_header.pl SRR650317_2.fasta.rc SRR650317_2.fa.rc Truseq_SRR650317right 
4) Produce files UNMAP and UNMAP_expr:
[user@@cn3316 ~]$ acfs perl Truseq_merge_unique_fa.pl UNMAP SRR650317_1.fa SRR650317_1.fa.rc SRR650317_2.fa SRR650317_2.fa.rc 
5) Build BWA index using human genome reference data (has been done already)
[user@@cn3316 ~]$ acfs bwa index /fdb/ACFS/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa 
6) Prepare for annotation (has been done already)
[user@@cn3316 ~]$ acfs perl /acfs/get_split_exon_border_biotype_genename.pl /fdb/ACFS/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf /fdb/ACFS/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes_split_exon.gtf 
7) Create and run the ACFS pipeline script:
[user@@cn3316 ~]$ acfs perl /acfs/ACF_MAKE.pl  $ACFS_TEST/SPEC_Homo_sapiens_Ensembl_GRCh37.txt BASH_Homo_sapiens_Ensembl_GRCh37.sh 
[user@@cn3316 ~]$ acfs bash BASH_Homo_sapiens_Ensembl_GRCh37.sh 
End the session:
[user@@cn3316 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. ACFS.sh). For example:

#!/bin/bash
module load ACFS 
(enter other relevant commands, like those listed above)

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] ACFS.sh