High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
scallop on Biowulf & Helix

Description

From the scallop repository:

Scallop is an accurate reference-based transcript assembler. Scallop features its high accuracy in assembling multi-exon transcripts as well as lowly expressed transcripts. Scallop achieves this improvement through a novel algorithm that can be proved preserving all phasing paths from paired-end reads, while also achieves both transcripts parsimony and coverage deviation minimization.

There may be multiple versions of scallop available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail scallop 

To select a module use

module load scallop/[version]

where [version] is the version of choice.

Environment variables set

References

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described below

biowulf$ sinteractive --mem=7g
salloc.exe: Pending job allocation 38002478
salloc.exe: job 38002478 queued and waiting for resources
[...snip...]
salloc.exe: Nodes cn2312 are ready for job
node$ module load scallop/0.9.8
[+] Loading scallop 0.9.8

node$ # copy some example data - paired end RNA-Seq (101nt) of human
      # skin aligned with STAR for about a quarter of chr8 including Myc.
      # this is stranded RNA-Seq data
node$ cp $SCALLOP_TEST_DATA/ENCSR862RGX.bam
node$ ls -lh ENCSR862RGX.bam
-rw-r--r-- 1 user group 81M Apr 18 07:30 ENCSR862RGX.bam

node$ # run scallop
node$ scallop --library_type second --min_transcript_length 300 \
             -i ENCSR862RGX.bam -o ENCSR862RGX.gtf
command line: scallop --library_type second --min_transcript_length 300 -i ENCSR862RGX.bam -o ENCSR862RG X.gtf


Bundle 0: tid = 7, #hits = 93, #partial-exons = 22, range = chr8:101915822-102124299, orient = + (93, 0, 0)
process splice graph bundle.0.0 type = 0, vertices = 5, edges = 4
process splice graph bundle.0.1 type = 1, vertices = 3, edges = 0
process splice graph bundle.0.2 type = 1, vertices = 3, edges = 0
process splice graph bundle.0.3 type = 1, vertices = 3, edges = 0
[...snip...]
node$ wc -l ENCSR862RGX.gtf
5297 ENCSR862RGX.gtf
node$ egrep '"bundle.1.5"' ENCSR862RGX.gtf | head -3
chr8    scallop transcript      102204502       102239040       1000    +       .       gene_id "bundle.  1.5"; transcript_id "bundle.1.5.2"; RPKM "22.4072"; cov "1.9556";
chr8    scallop exon    102204502       102205959       1000    +       .       gene_id "bundle.1.5"; tr anscript_id "bundle.1.5.2"; exon "1";
chr8    scallop exon    102208095       102208285       1000    +       .       gene_id "bundle.1.5"; tr anscript_id "bundle.1.5.2"; exon "2";

node$ exit
biowulf$

Example of scallop assembled transcripts for chr8:103,394,722-103,446,765. The scallop transcripts are shown in black. Gencode v24 annotation is shown in blue.

scallop example results
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is scallop.batch

module load scallop/0.9.8 || exit 1
scallop --verbose 0 --library_type second \
    -i ENCSR862RGX.bam -o ENCSR862RGX.gtf

Submit to the queue with sbatch:

biowulf$ sbatch --mem=7g scallop.batch