Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454.
Velvet takes in short read sequences, removes errors and produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
$OMP_NUM_THREADS
environment variable automatically to match $SLURM_CPUS_PER_TASK
Here is a heuristic first posted by Simon Gladman to estimate memory usage for velvet:
velvetg Ram [GB] = (-109635 + 18977 * RS + 86326 * GS + 233353 * NR - 51092 * K) / 1048576 where RS = Read size in nts GS = Genome size in megabases (Mb) NR = number of reads in millions K = kmer hash value used in velveth The results are +/- 0.5 - 0.8 Gbytes on my system. (64 bit fedora 10 - quad core - 16Gb RAM)
For example, for an assembly with k = 31
using 50 million 50nt
reads and a genome size of 5 Megabases:
-109635 18977 x 50 = 948850 86326 x 5 = 431630 233353 x 100 = 23335300 -51092 x 31 = -1583852 -------- 23022293 kB 22 GB
The following example session will make use of data from Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. This data contains short insert paired end reads (SRR292678) and long insert mate pair reads (SRR292770), which have to be reverse complemented since velvet expects reads to be in FR (inwards) orientation. This data can be obtained and preprocessed as follows:
[user@biowulf]$ sinteractive --cpus-per-task=6 --mem=15g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load sratoolkit cutadapt fastxtoolkit [user@cn3144 ~]$ # paired end data with nominal insert size of 6kb [user@cn3144 ~]$ fastq-dump --gzip --split-files -O fastq SRR292770 [user@cn3144 ~]$ # paired end data with nominal insert size of 470nts [user@cn3144 ~]$ fastq-dump -O fastq --gzip --split-files SRR292678 [user@cn3144 ~]$ cd fastq [user@cn3144 ~]$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC \ --minimum-length 30 -q 20,20 \ -o SRR292770_filt_1.fastq.gz -p SRR292770_filt_2.fastq.gz \ SRR292770_1.fastq.gz SRR292770_2.fastq.gz [user@cn3144 ~]$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC \ --minimum-length 30 -q 20,20 \ -o SRR292678_filt_1.fastq.gz -p SRR292678_filt_2.fastq.gz \ SRR292678_1.fastq.gz SRR292678_2.fastq.gz [user@cn3144 ~]$ zcat SRR292770_filt_1.fastq.gz \ | fastx_reverse_complement -z -o SRR292770_filt_rc_1.fastq.gz [user@cn3144 ~]$ zcat SRR292770_filt_2.fastq.gz \ | fastx_reverse_complement -z -o SRR292770_filt_rc_2.fastq.gz [user@cn3144 ~]$ cd ..
The assembly with Velvet is a two step process - first creating hashes with
velveth
. One important parameter is the kmer size. In this example, we will
try kmer sizes from 37 to 45. Note that kmer sizes have to be odd and that
libraries should be ordered from smallest to largest insert size. Note also that
each library goes into it's own channel as indicated by the number after the
read type option:
[user@cn3144 ~]$ module load velvet [user@cn3144 ~]$ mkdir velvet && cd velvet [user@cn3144 ~]$ velveth auto 37,47,2 \ -shortPaired -fastq.gz -separate \ ../fastq/SRR292678_filt_1.fastq.gz ../fastq/SRR292678_filt_2.fastq.gz \ -shortPaired2 -fastq.gz -separate \ ../fastq/SRR292770_filt_rc_1.fastq.gz ../fastq/SRR292770_filt_rc_2.fastq.gz [0.000000] Reading FastQ file ../fastq/SRR292678_filt_1.fastq.gz; [0.069126] Reading FastQ file ../fastq/SRR292678_filt_2.fastq.gz; [54.729141] 12221686 sequences found in total in the paired sequence files ...
Next, velvetg
builds the actual graph and does the assembly.
Note that the values for insert length and sd were taken from the
publication
[user@cn3144 ~]$ velvetg auto_37 -exp_cov auto -cov_cutoff auto \ -ins_length 468 -ins_length_sd 31 -ins_length2 6193 \ -ins_length2_sd 566 -shortMatePaired2 yes .... [565.945619] Estimated Coverage = 136.638609 [565.945642] Estimated Coverage cutoff = 68.319305 Final graph has 611 nodes and n50 of 1451863, max 1806834, total 5455514, using 20748350/22424018 reads [user@cn3144 ~]$ velvetg auto_39 -exp_cov auto -cov_cutoff auto \ -ins_length 468 -ins_length_sd 31 -ins_length2 6193 \ -ins_length2_sd 566 -shortMatePaired2 yes ... [521.462552] Estimated Coverage = 128.491151 [521.462575] Estimated Coverage cutoff = 64.245575 Final graph has 562 nodes and n50 of 3226047, max 3226047, total 5467489, using 20530065/22424018 reads [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. velvet.sh) similar to the following example:
#! /bin/bash set -e cd /data/$USER/test_data/velvet module load velvet/1.2.10 || exit 1 velveth k39 39 \ -shortPaired -fastq.gz -separate \ ../fastq/SRR292678_filt_1.fastq.gz ../fastq/SRR292678_filt_2.fastq.gz \ -shortPaired2 -fastq.gz -separate \ ../fastq/SRR292770_filt_rc_1.fastq.gz ../fastq/SRR292770_filt_rc_2.fastq.gz velvetg k39 -exp_cov auto -cov_cutoff auto \ -ins_length 468 -ins_length_sd 31 \ -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=6 --mem=12g velvet.sh
Note that there is a limit to the amount of time that can be saved by adding more cores to the calculations:
cores | time [s] |
---|---|
2 | 639 |
4 | 411 |
6 | 348 |
8 | 319 |
This is likely a combination of deminishing returns with increasing threads and the fact that only parts of the algorithm are parallel.
Create a swarmfile (e.g. velvet.swarm). For example:
velvetg auto_39 -exp_cov auto -cov_cutoff auto \ -ins_length 468 -ins_length_sd 31 \ -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes velvetg auto_41 -exp_cov auto -cov_cutoff auto \ -ins_length 468 -ins_length_sd 31 \ -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes velvetg auto_43 -exp_cov auto -cov_cutoff auto \ -ins_length 468 -ins_length_sd 31 \ -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes
Submit this job using the swarm command.
swarm -f velvet.swarm -g 10 -t 4 --module velvet/1.2.10where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module velvet | Loads the velvet module for each subjob in the swarm |