High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
supernova on Biowulf & Helix

Description

Supernova is used to de novo assemble diploid genomes from Chromium Linked-Reads generated from a single genome. This is done by attaching unique barcodes to reads originating from a single large DNA molecule. The assembly is done in 3 steps:

Note that supernova is optimized to run on a single node. Assembly of a 56x coverage human genome should be possible with 248GB of memory.

There may be multiple versions of supernova available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail supernova 

To select a module use

module load supernova/[version]

where [version] is the version of choice.

supernova is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

Dependencies

Dependencies are loaded automatically

References

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive, set up the environment, and copy the test data

biowulf$ sinteractive 
node$ module load supernova
[+] Loading bcl2fastq 2.19.0
[+] Loading supernova 1.2.0
node$ cp $SUPERNOVA_TEST_DATA/tiny-bcl-2.0.0.tar.gz .
node$ cp cp $SUPERNOVA_TEST_DATA/tiny-bcl-samplesheet-2.1.0.csv .
node$ tar -xzf tiny-bcl-2.0.0.tar.gz && rm -f tiny-bcl-2.0.0.tar.gz

Convert the bcl files to fastq. Note that in this and the following step, supernova makes the assumption that it can use the whole node. Therefore, if the node is not allocated exclusively, the --localcores and --localmem options have to be used to limit resource usage to the allocated resouces.

node$ supernova mkfastq \
    --run=tiny-bcl-2.0.0 \
    --samplesheet=tiny-bcl-samplesheet-2.1.0.csv \
    --delete-undetermined \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60
supernova mkfastq (1.2.0)
Copyright (c) 2017 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - 1.2.0 (2.1.3)
[...snip...]
Outputs:
- Run QC metrics:        H77WWBBXX/outs/qc_summary.json
- FASTQ output folder:   H77WWBBXX/outs/fastq_path
- Interop output folder: H77WWBBXX/outs/interop_path
- Input samplesheet:     H77WWBBXX/outs/input_samplesheet.csv

Pipestance completed successfully!

Create the assembly

node$ supernova run \
    --id=Sample1 \
    --fastqs=H77WWBBXX/outs/fastq_path \
    --sample=Sample1 \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60
supernova run (1.2.0)
Copyright (c) 2017 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - 1.2.0 (2.1.3)
Running preflight checks (please wait)...
[...snip...]
Outputs:
- Run summary:        Sample1/outs/summary.csv
- Run report:         Sample1/outs/report.txt
- Raw assembly files: Sample1/outs/assembly

node$ cat Sample1/outs/report.txt
--------------------------------------------------------------------------------
SUMMARY
--------------------------------------------------------------------------------
- Tue Jul 18 10:55:58 2017
- [Sample1]  
- commit hash = b5305e091
- assembly checksum = 2,484,468,531,946,829
--------------------------------------------------------------------------------
INPUT
-    5.50 M  = READS          = number of reads; ideal 800M-1200M for human
-  138.50  b = MEAN READ LEN  = mean read length after trimming; ideal 140
-   17.66  x = EFFECTIVE COV  = effective read coverage; ideal ~42 for nominal 56x cov
-   77.72  % = READ TWO Q30   = fraction of Q30 bases in read 2; ideal 75-85
-  318.00  b = MEDIAN INSERT  = median insert size; ideal 0.35-0.40
-   87.12  % = PROPER PAIRS   = fraction of proper read pairs; ideal >= 75
-    0.00  b = MOLECULE LEN   = weighted mean molecule size; ideal 50-100
-    0.00  b = HETDIST        = mean distance between heterozygous SNPs
-    5.50  % = UNBAR          = fraction of reads that are not barcoded
-    6.00    = BARCODE N50    = N50 reads per barcode
-    2.52  % = DUPS           = fraction of reads that are duplicates
-    2.86  % = PHASED         = nonduplicate and phased reads; ideal 45-50
--------------------------------------------------------------------------------
OUTPUT
-    0.00    = LONG SCAFFOLDS = number of scaffolds >= 10 kb
-  356.00  b = EDGE N50       = N50 edge size
-    0.00  b = CONTIG N50     = N50 contig size
-    1.07 Kb = PHASEBLOCK N50 = N50 phase block size
-    0.00  b = SCAFFOLD N50   = N50 scaffold size
-    0.00  b = ASSEMBLY SIZE  = assembly size (only scaffolds >= 10 kb)
--------------------------------------------------------------------------------

Create fasta outputs for the 2 phased haplotypes

node$ supernova mkoutput \
    --asmdir=Sample1/outs/assembly \
    --outprefix=Sample1 \
    --style=pseudohap2
[...snip...]
node$ ls -lh
drwxr-xr-x 4 user group 4.0K Jul 18 10:03 H77WWBBXX
-rw-r--r-- 1 user group  970 Jul 18 10:00 __H77WWBBXX.mro
drwxr-xr-x 4 user group 4.0K Jul 18 10:56 Sample1
-rw-r--r-- 1 user group  81K Jul 18 10:59 Sample1.1.fasta.gz
-rw-r--r-- 1 user group  81K Jul 18 10:59 Sample1.2.fasta.gz
drwxr-xr-x 4 user group 4.0K Apr 15  2016 tiny-bcl-2.0.0
-rw-r--r-- 1 user group  127 Jul 18 09:54 tiny-bcl-samplesheet-2.1.0.csv
node$ zcat Sample1.1.fasta.gz | head -5
>1 edges=1274..4555 left=5450 right=1915 ver=1.7 style=4
GGCATCAAAGCGCTCCAAATGTCCACATCCAGATACTCCAGAAAGAGTGTTTCAAACCTGCTCTATGAAAGGGAATCTTC
AACTCTATGAGTTGAATGCAGACATCAGAAAGAAATTTCTGAGAATGCTGCTGTCTACCTTTTATTTGAATTCCCGCTTC
CAACGAAATCCTCCAAGCTATCCAAATATCCACTTGCAGATTCCACAAAAAGAGTGTTTCAAAACTGCTCTCTATCAATG
GCAAAGTTCAACTCTGTTAGTTGAGGACACATATCACCAACAAGTTTCTGAGAATGCTTCTGTCCATTTTTTATGGGAAG

Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is supernova.batch

module load supernova/1.2.0 || exit 1

supernova mkfastq \
    --run=tiny-bcl-2.0.0 \
    --samplesheet=tiny-bcl-samplesheet-2.1.0.csv \
    --delete-undetermined \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60

# supernova run on each sample separately; here only one sample
supernova run \
    --id=Sample1 \
    --fastqs=H77WWBBXX/outs/fastq_path \
    --sample=Sample1 \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60

# generate fasta of assembly.
supernova mkoutput \
    --asmdir=Sample1/outs/assembly \
    --outprefix=Sample1 \
    --style=pseudohap2

Submit to the queue with sbatch:

biowulf$ sbatch --mem=64g --cpus-per-task=16 supernova.batch