Unicycler is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load unicycler
[+] Loading python 3.5 ...
[+] Loading bowtie 2-2.3.5.1
[+] Loading racon 1.3.2 ...
[+] Loading blast 2.9.0+ ...
[+] Loading pilon 1.23
[+] Loading unicycler 0.4.8
[user@cn3144 ~]$ unicycler --help
usage: unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2] [-s UNPAIRED] [-l LONG] -o OUT [--verbosity VERBOSITY] [--min_fasta_length MIN_FASTA_LENGTH]
[--keep KEEP] [-t THREADS] [--mode {conservative,normal,bold}] [--linear_seqs LINEAR_SEQS] [--vcf]
__
\ \___
\ ___\
//
____// _ _ _ _
//_ //\\ | | | | |_| | |
// \// \\ | | | | _ __ _ ___ _ _ ___ | | ___ _ __
|| (O) || | | | || '_ \ | | / __|| | | | / __|| | / _ \| '__|
\\ \_ // | |__| || | | || || (__ | |_| || (__ | || __/| |
\\_____// \____/ |_| |_||_| \___| \__, | \___||_| \___||_|
__/ |
|___/
Unicycler: an assembly pipeline for bacterial genomes
Help:
-h, --help Show this help message and exit
--help_all Show a help message with all program options
--version Show Unicycler's version number
Input:
-1 SHORT1, --short1 SHORT1 FASTQ file of first short reads in each pair (required)
-2 SHORT2, --short2 SHORT2 FASTQ file of second short reads in each pair (required)
-s UNPAIRED, --unpaired UNPAIRED FASTQ file of unpaired short reads (optional)
-l LONG, --long LONG FASTQ or FASTA file of long reads (optional)
Output:
-o OUT, --out OUT Output directory (required)
--verbosity VERBOSITY Level of stdout and log file information (default: 1)
0 = no stdout, 1 = basic progress indicators, 2 = extra info, 3 = debugging info
--min_fasta_length MIN_FASTA_LENGTH Exclude contigs from the FASTA file which are shorter than this length (default: 100)
--keep KEEP Level of file retention (default: 1)
0 = only keep final files: assembly (FASTA, GFA and log), 1 = also save graphs at main checkpoints,
2 = also keep SAM (enables fast rerun in different mode), 3 = keep all temp files and save all graphs (for debugging)
--vcf Produce a VCF by mapping the short reads to the final assembly (experimental, default: do not produce a vcf file)
Other:
-t THREADS, --threads THREADS Number of threads used (default: 8)
--mode {conservative,normal,bold} Bridging mode (default: normal)
conservative = smaller contigs, lowest misassembly rate
normal = moderate contig size and misassembly rate
bold = longest contigs, higher misassembly rate
--linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in the underlying sequence (default: 0)
[user@cn3144 ~]$ unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -o output_dir
Starting Unicycler (2019-11-22 10:42:56)
Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you provided only short reads, Unicycler will essentially function as a SPAdes-optimiser. It
will try many k-mer sizes, choose the best based on contig length and graph connectivity, and scaffold the graph using SPAdes repeat resolution.
For more information, please see https://github.com/rrwick/Unicycler
Command: unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -o output_dir
Unicycler version: v0.4.8
Using 8 threads
Dependencies:
Program Version Status
spades.py 3.13.1 good
racon not used
makeblastdb 2.9.0+ good
tblastn 2.9.0+ good
bowtie2-build 2.3.5.1 good
bowtie2 2.3.5.1 good
samtools 1.9 good
java 1.8.0_181 good
pilon 1.23 good
bcftools not used
SPAdes read error correction (2019-11-22 10:42:59)
Unicycler uses the SPAdes read error correction module to reduce the number of errors in the short read before SPAdes assembly. This can make the assembly faster and
simplify the assembly graph structure.
Command: spades.py -1 TEST_DATA/short_reads_1.fastq.gz -2 TEST_DATA/short_reads_2.fastq.gz -o TEST_DATA/output_dir/spades_assembly/read_correction --threads 8 --only-error-correction
Corrected reads:
TEST_DATA/output_dir/spades_assembly/corrected_1.fastq.gz
TEST_DATA/output_dir/spades_assembly/corrected_2.fastq.gz
Choosing k-mer range for assembly (2019-11-22 10:43:38)
Unicycler chooses a k-mer range for SPAdes based on the length of the input reads. It uses a wide range of many k-mer sizes to maximise the chance of finding an ideal
assembly.
SPAdes maximum k-mer: 127
Median read length: 125
K-mer range: 25, 43, 59, 73, 83, 93, 101, 107, 113, 119
SPAdes assemblies (2019-11-22 10:43:39)
Unicycler now uses SPAdes to assemble the short reads. It scores the assembly graph for each k-mer using the number of contigs (fewer is better) and the number of dead
ends (fewer is better). The score function is 1/(c*(d+2)), where c is the contig count and d is the dead end count.
K-mer Contigs Dead ends Score
25 failed
43 failed
59 failed
73 failed
83 failed
93 failed
101 failed
107 failed
113 failed
119 111 28 3.00e-04 ← best
Read depth filter: removed 0 contigs totalling 0 bp
Deleting TEST_DATA/output_dir/spades_assembly/
Determining graph multiplicity (2019-11-22 10:44:38)
Multiplicity is the number of times a sequence occurs in the underlying sequence. Single-copy contigs (those with a multiplicity of one, occurring only once in the
underlying sequence) are particularly useful.
Saving TEST_DATA/output_dir/001_best_spades_graph.gfa
Cleaning graph (2019-11-22 10:44:38)
Unicycler now performs various cleaning procedures on the graph to remove overlaps and simplify the graph structure. The end result is a graph ready for bridging.
Graph overlaps removed
Removed zero-length segments:
83, 84, 86, 87, 91, 94, 100, 103
Removed zero-length segments:
85
Merged small segments:
108, 109, 110, 111
Saving TEST_DATA/output_dir/002_overlaps_removed.gfa
Unicycler now selects a set of anchor contigs from the single-copy contigs. These are the contigs which will be connected via bridges to form the final assembly.
36 anchor segments (174,096 bp) out of 98 total segments (191,318 bp)
Creating SPAdes contig bridges (2019-11-22 10:44:38)
SPAdes uses paired-end information to perform repeat resolution (RR) and produce contigs from the assembly graph. SPAdes saves the graph paths corresponding to these
contigs in the contigs.paths file. When one of these paths contains two or more anchor contigs, Unicycler can create a bridge from the path.
Bridge
Start Path End quality
-4 12 55.1
1 -22 62.8
9 -34 1 9.2
11 77 → -62 22 59.9
12 47 28 54.7
18 -77 → 76 → 80 33 42.8
31 -72 -32 62.2
33 -80 → 85 30 51.2
37 75 11 60.3
Creating loop unrolling bridges (2019-11-22 10:44:38)
When a SPAdes contig path connects an anchor contig with the middle contig of a simple loop, Unicycler concludes that the sequences are contiguous (i.e. the loop is
not a separate piece of DNA). It then uses the read depth of the middle and repeat contigs to guess the number of times to traverse the loop and makes a bridge.
No loop unrolling bridges made
Applying bridges (2019-11-22 10:44:38)
Unicycler now applies to the graph in decreasing order of quality. This ensures that when multiple, contradictory bridges exist, the most supported option is used.
Bridge type Start → end Path Quality
SPAdes 1 → -22 62.822
SPAdes 31 → -32 -72 62.196
SPAdes 37 → 11 75 60.300
SPAdes 11 → 22 77, -62 59.937
SPAdes -4 → 12 55.117
SPAdes 12 → 28 47 54.728
SPAdes 33 → 30 -80, 85 51.155
SPAdes 18 → 33 -77, 76, 80 42.846
Saving TEST_DATA/output_dir/003_bridges_applied.gfa
Bridged assembly graph (2019-11-22 10:44:38)
The assembly is now mostly finished and no more structural changes will be made. Ideally the assembly graph should now have one contig per replicon and no erroneous
contigs (i.e a complete assembly). If there are more contigs, then the assembly is not complete.
Saving TEST_DATA/output_dir/004_final_clean.gfa
Component Segments Links Length N50 Longest segment Status
total 66 77 189,317 7,866 43,777
1 63 76 178,782 8,115 43,777 incomplete
2 1 1 5,153 5,153 5,153 complete
3 1 0 3,283 3,283 3,283 incomplete
4 1 0 2,099 2,099 2,099 incomplete
Polishing assembly with Pilon (2019-11-22 10:44:38)
Unicycler now conducts multiple rounds of Pilon in an attempt to repair any remaining small-scale errors with the assembly.
Aligning reads to find appropriate insert size range...
Insert size 1st percentile: 264
Insert size 99th percentile: 549
Pilon polish round 1
Unable to polish assembly using Pilon: Pilon did not produce FASTA file
Rotating completed replicons (2019-11-22 10:45:09)
Any completed circular contigs (i.e. single contigs which have one link connecting end to start) can have their start position changed without altering the sequence.
For consistency, Unicycler now searches for a starting gene (dnaA or repA) in each such contig, and if one is found, the contig is rotated to start with that gene on the
forward strand.
Segment Length Depth Starting gene Position Strand Identity Coverage
11 5,153 4.01x none found
Assembly complete (2019-11-22 10:45:23)
Saving TEST_DATA/output_dir/assembly.gfa
Saving TEST_DATA/output_dir/assembly.fasta
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Create a batch input file (e.g. unicycler.sh). For example:
#!/bin/bash
# sbatch --gres=lscratch:10 --cpus-per-task=16 unicycler.sh
set -e
module load unicycler
cp /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/short_reads_1.fastq.gz /lscratch/$SLURM_JOB_ID/.
cp /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/short_reads_2.fastq.gz /lscratch/$SLURM_JOB_ID/.
unicycler -1 /lscratch/$SLURM_JOB_ID/short_reads_1.fastq.gz \
-2 /lscratch/$SLURM_JOB_ID/short_reads_2.fastq.gz \
-o /lscratch/$SLURM_JOB_ID/OUTPUT
cp -fr /lscratch/$SLURM_JOB_ID/OUTPUT /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/.
Submit this job using the Slurm sbatch command.
sbatch [--gres=lscratch:#] [--cpus-per-task=#] [--mem=#] unicycler.sh
Create a swarmfile (e.g. unicycler.swarm). For example:
#
# swarm -f unicycler.swarm --gres=lscratch:10 --threads-per-process=16 --module unicycler
#
cp /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/short_reads_1.fastq.gz /lscratch/$SLURM_JOB_ID/; \
cp /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/short_reads_2.fastq.gz /lscratch/$SLURM_JOB_ID/; \
unicycler -1 /lscratch/$SLURM_JOB_ID/short_reads_1.fastq.gz \
-2 /lscratch/$SLURM_JOB_ID/short_reads_2.fastq.gz \
-o /lscratch/$SLURM_JOB_ID/OUTPUT; \
mv /lscratch/$SLURM_JOB_ID/OUTPUT /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/.
cp /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/long_reads_high_depth.fastq.gz /lscratch/$SLURM_JOB_ID/ ; \
unicycler -l /lscratch/$SLURM_JOB_ID/long_reads_high_depth.fastq.gz \
-o /lscratch/$SLURM_JOB_ID/OUTPUT2; \
mv /lscratch/$SLURM_JOB_ID/OUTPUT2 /data/$SLURM_JOB_USER/UNICYCLER_TEST/TEST_DATA/.
Submit this job using the swarm command.
swarm -f unicycler.swarm [--gres=lscratch:#] [-g #] [-t #] --module unicyclerwhere
| -g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
| -t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
| --module unicycler | Loads the unicycler module for each subjob in the swarm |