velvet on Biowulf

Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454.

Velvet takes in short read sequences, removes errors and produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

References:

Documentation
Important Notes
Memory usage

Here is a heuristic first posted by Simon Gladman to estimate memory usage for velvet:

velvetg Ram [GB] = (-109635 + 18977 * RS + 86326 * GS + 233353 * NR - 51092 * K) / 1048576

where
   RS = Read size in nts
   GS = Genome size in megabases (Mb)
   NR = number of reads in millions
   K  = kmer hash value used in velveth

The results are +/- 0.5 - 0.8 Gbytes on my system. (64 bit fedora 10 - quad core - 16Gb RAM)

For example, for an assembly with k = 31 using 50 million 50nt reads and a genome size of 5 Megabases:

                 -109635 
  18977 x  50 =   948850  
  86326 x   5 =   431630  
 233353 x 100 = 23335300    
 -51092 x  31 = -1583852
                --------
                23022293 kB
                22       GB
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

The following example session will make use of data from Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. This data contains short insert paired end reads (SRR292678) and long insert mate pair reads (SRR292770), which have to be reverse complemented since velvet expects reads to be in FR (inwards) orientation. This data can be obtained and preprocessed as follows:

[user@biowulf]$ sinteractive --cpus-per-task=6 --mem=15g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load sratoolkit cutadapt fastxtoolkit
[user@cn3144 ~]$ # paired end data with nominal insert size of 6kb
[user@cn3144 ~]$ fastq-dump --gzip --split-files -O fastq SRR292770
[user@cn3144 ~]$ # paired end data with nominal insert size of 470nts 
[user@cn3144 ~]$ fastq-dump -O fastq --gzip --split-files SRR292678
[user@cn3144 ~]$ cd fastq
[user@cn3144 ~]$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC  \
           --minimum-length 30 -q 20,20 \
           -o SRR292770_filt_1.fastq.gz -p SRR292770_filt_2.fastq.gz \
           SRR292770_1.fastq.gz SRR292770_2.fastq.gz
[user@cn3144 ~]$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC  \
           --minimum-length 30 -q 20,20 \
           -o SRR292678_filt_1.fastq.gz -p SRR292678_filt_2.fastq.gz \
           SRR292678_1.fastq.gz SRR292678_2.fastq.gz
[user@cn3144 ~]$ zcat SRR292770_filt_1.fastq.gz \
               | fastx_reverse_complement -z -o SRR292770_filt_rc_1.fastq.gz
[user@cn3144 ~]$ zcat SRR292770_filt_2.fastq.gz \
               | fastx_reverse_complement -z -o SRR292770_filt_rc_2.fastq.gz
[user@cn3144 ~]$ cd ..

The assembly with Velvet is a two step process - first creating hashes with velveth. One important parameter is the kmer size. In this example, we will try kmer sizes from 37 to 45. Note that kmer sizes have to be odd and that libraries should be ordered from smallest to largest insert size. Note also that each library goes into it's own channel as indicated by the number after the read type option:

[user@cn3144 ~]$ module load velvet
[user@cn3144 ~]$ mkdir velvet && cd velvet
[user@cn3144 ~]$ velveth auto 37,47,2 \
    -shortPaired -fastq.gz -separate \
        ../fastq/SRR292678_filt_1.fastq.gz ../fastq/SRR292678_filt_2.fastq.gz \
    -shortPaired2 -fastq.gz -separate \
        ../fastq/SRR292770_filt_rc_1.fastq.gz ../fastq/SRR292770_filt_rc_2.fastq.gz
[0.000000] Reading FastQ file ../fastq/SRR292678_filt_1.fastq.gz;
[0.069126] Reading FastQ file ../fastq/SRR292678_filt_2.fastq.gz;
[54.729141] 12221686 sequences found in total in the paired sequence files
...

Next, velvetg builds the actual graph and does the assembly. Note that the values for insert length and sd were taken from the publication

[user@cn3144 ~]$ velvetg auto_37 -exp_cov auto -cov_cutoff auto \
           -ins_length 468 -ins_length_sd 31 -ins_length2 6193 \
           -ins_length2_sd 566 -shortMatePaired2 yes
....
[565.945619] Estimated Coverage = 136.638609
[565.945642] Estimated Coverage cutoff = 68.319305
Final graph has 611 nodes and n50 of 1451863, max 1806834, total 5455514, using 20748350/22424018 reads
[user@cn3144 ~]$ velvetg auto_39 -exp_cov auto -cov_cutoff auto \
           -ins_length 468 -ins_length_sd 31 -ins_length2 6193 \
           -ins_length2_sd 566 -shortMatePaired2 yes
...
[521.462552] Estimated Coverage = 128.491151
[521.462575] Estimated Coverage cutoff = 64.245575
Final graph has 562 nodes and n50 of 3226047, max 3226047, total 5467489, using 20530065/22424018 reads

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. velvet.sh) similar to the following example:

#! /bin/bash

set -e 

cd /data/$USER/test_data/velvet
module load velvet/1.2.10 || exit 1
velveth k39 39 \
    -shortPaired -fastq.gz -separate \
        ../fastq/SRR292678_filt_1.fastq.gz ../fastq/SRR292678_filt_2.fastq.gz \
    -shortPaired2 -fastq.gz -separate \
        ../fastq/SRR292770_filt_rc_1.fastq.gz ../fastq/SRR292770_filt_rc_2.fastq.gz
velvetg k39 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=6 --mem=12g velvet.sh

Note that there is a limit to the amount of time that can be saved by adding more cores to the calculations:

cores time [s]
2 639
4 411
6 348
8 319

This is likely a combination of deminishing returns with increasing threads and the fact that only parts of the algorithm are parallel.

Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. velvet.swarm). For example:

velvetg auto_39 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes
velvetg auto_41 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes
velvetg auto_43 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes

Submit this job using the swarm command.

swarm -f velvet.swarm -g 10 -t 4 --module velvet/1.2.10
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module velvet Loads the velvet module for each subjob in the swarm