High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
supernova on Biowulf

Supernova is used to de novo assemble diploid genomes from Chromium Linked-Reads generated from a single genome. This is done by attaching unique barcodes to reads originating from a single large DNA molecule. The assembly is done in 3 steps:

Note that supernova is optimized to run on a single node. Assembly of a 56x coverage human genome should be possible with 248GB of memory.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive

[user@cn3144 ~]$

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
[user@biowulf]$ sinteractive --mem=20g --cpus-per-task=6
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load supernova
[user@cn3144 ~]$ cp $SUPERNOVA_TEST_DATA/tiny-bcl-2.0.0.tar.gz .
[user@cn3144 ~]$ cp cp $SUPERNOVA_TEST_DATA/tiny-bcl-samplesheet-2.1.0.csv .
[user@cn3144 ~]$ tar -xzf tiny-bcl-2.0.0.tar.gz && rm -f tiny-bcl-2.0.0.tar.gz

Convert the bcl files to fastq. Note that in this and the following step, supernova makes the assumption that it can use the whole node. Therefore, if the node is not allocated exclusively, the --localcores and --localmem options have to be used to limit resource usage to the allocated resouces.

[user@cn3144 ~]$ supernova mkfastq \
    --run=tiny-bcl-2.0.0 \
    --samplesheet=tiny-bcl-samplesheet-2.1.0.csv \
    --delete-undetermined \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60
supernova mkfastq (1.2.0)
Copyright (c) 2017 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - 1.2.0 (2.1.3)
[...snip...]
Outputs:
- Run QC metrics:        H77WWBBXX/outs/qc_summary.json
- FASTQ output folder:   H77WWBBXX/outs/fastq_path
- Interop output folder: H77WWBBXX/outs/interop_path
- Input samplesheet:     H77WWBBXX/outs/input_samplesheet.csv

Pipestance completed successfully!

Create the assembly

[user@cn3144 ~]$ supernova run \
    --id=Sample1 \
    --fastqs=H77WWBBXX/outs/fastq_path \
    --sample=Sample1 \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60
supernova run (1.2.0)
Copyright (c) 2017 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - 1.2.0 (2.1.3)
Running preflight checks (please wait)...
[...snip...]
Outputs:
- Run summary:        Sample1/outs/summary.csv
- Run report:         Sample1/outs/report.txt
- Raw assembly files: Sample1/outs/assembly

[user@cn3144 ~]$ cat Sample1/outs/report.txt
--------------------------------------------------------------------------------
SUMMARY
--------------------------------------------------------------------------------
- Tue Jul 18 10:55:58 2017
- [Sample1]  
- commit hash = b5305e091
- assembly checksum = 2,484,468,531,946,829
--------------------------------------------------------------------------------
INPUT
-    5.50 M  = READS          = number of reads; ideal 800M-1200M for human
-  138.50  b = MEAN READ LEN  = mean read length after trimming; ideal 140
-   17.66  x = EFFECTIVE COV  = effective read coverage; ideal ~42 for nominal 56x cov
-   77.72  % = READ TWO Q30   = fraction of Q30 bases in read 2; ideal 75-85
-  318.00  b = MEDIAN INSERT  = median insert size; ideal 0.35-0.40
-   87.12  % = PROPER PAIRS   = fraction of proper read pairs; ideal >= 75
-    0.00  b = MOLECULE LEN   = weighted mean molecule size; ideal 50-100
-    0.00  b = HETDIST        = mean distance between heterozygous SNPs
-    5.50  % = UNBAR          = fraction of reads that are not barcoded
-    6.00    = BARCODE N50    = N50 reads per barcode
-    2.52  % = DUPS           = fraction of reads that are duplicates
-    2.86  % = PHASED         = nonduplicate and phased reads; ideal 45-50
--------------------------------------------------------------------------------
OUTPUT
-    0.00    = LONG SCAFFOLDS = number of scaffolds >= 10 kb
-  356.00  b = EDGE N50       = N50 edge size
-    0.00  b = CONTIG N50     = N50 contig size
-    1.07 Kb = PHASEBLOCK N50 = N50 phase block size
-    0.00  b = SCAFFOLD N50   = N50 scaffold size
-    0.00  b = ASSEMBLY SIZE  = assembly size (only scaffolds >= 10 kb)
--------------------------------------------------------------------------------

Create fasta outputs for the 2 phased haplotypes

[user@cn3144 ~]$ supernova mkoutput \
    --asmdir=Sample1/outs/assembly \
    --outprefix=Sample1 \
    --style=pseudohap2
[...snip...]
[user@cn3144 ~]$ ls -lh
drwxr-xr-x 4 user group 4.0K Jul 18 10:03 H77WWBBXX
-rw-r--r-- 1 user group  970 Jul 18 10:00 __H77WWBBXX.mro
drwxr-xr-x 4 user group 4.0K Jul 18 10:56 Sample1
-rw-r--r-- 1 user group  81K Jul 18 10:59 Sample1.1.fasta.gz
-rw-r--r-- 1 user group  81K Jul 18 10:59 Sample1.2.fasta.gz
drwxr-xr-x 4 user group 4.0K Apr 15  2016 tiny-bcl-2.0.0
-rw-r--r-- 1 user group  127 Jul 18 09:54 tiny-bcl-samplesheet-2.1.0.csv
[user@cn3144 ~]$ zcat Sample1.1.fasta.gz | head -5
>1 edges=1274..4555 left=5450 right=1915 ver=1.7 style=4
GGCATCAAAGCGCTCCAAATGTCCACATCCAGATACTCCAGAAAGAGTGTTTCAAACCTGCTCTATGAAAGGGAATCTTC
AACTCTATGAGTTGAATGCAGACATCAGAAAGAAATTTCTGAGAATGCTGCTGTCTACCTTTTATTTGAATTCCCGCTTC
CAACGAAATCCTCCAAGCTATCCAAATATCCACTTGCAGATTCCACAAAAAGAGTGTTTCAAAACTGCTCTCTATCAATG
GCAAAGTTCAACTCTGTTAGTTGAGGACACATATCACCAACAAGTTTCTGAGAATGCTTCTGTCCATTTTTTATGGGAAG

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. supernova.sh), which uses the input file 'supernova.in'. For example:

#! /bin/bash

module load supernova/2.0.1 || exit 1

supernova mkfastq \
    --run=tiny-bcl-2.0.0 \
    --samplesheet=tiny-bcl-samplesheet-2.1.0.csv \
    --delete-undetermined \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60

# supernova run on each sample separately; here only one sample
supernova run \
    --id=Sample1 \
    --fastqs=H77WWBBXX/outs/fastq_path \
    --sample=Sample1 \
    --localcores=${SLURM_CPUS_PER_TASK} \
    --localmem=60

# generate fasta of assembly.
supernova mkoutput \
    --asmdir=Sample1/outs/assembly \
    --outprefix=Sample1 \
    --style=pseudohap2

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=16 --mem=64g supernova.sh