High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Velvet on Biowulf & Helix

Description

Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454.

Velvet takes in short read sequences, removes errors and produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

There may be multiple versions of velvet available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail velvet 

To select a module use

module load velvet/[version]

where [version] is the version of choice.

velvet is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

References

Documentation

Memory usage

Simon Gladman has a memory usage estimator for Velvet:

velvetg Ram [GB] = (-109635 + 18977 * RS + 86326 * GS + 233353 * NR - 51092 * K) / 1048576

where
   RS = Read size in nts
   GS = Genome size in megabases (Mb)
   NR = number of reads in millions
   K  = kmer hash value used in velveth

The results are +/- 0.5 - 0.8 Gbytes on my system. (64 bit fedora 10 - quad core - 16Gb RAM)

For example, for an assembly with k = 31 using 50 million 50nt reads and a genome size of 5 Megabases:

                 -109635 
  18977 x  50 =   948850  
  86326 x   5 =   431630  
 233353 x 100 = 23335300    
 -51092 x  31 = -1583852
                --------
                23022293 kB
                22       GB

Running velvet on Helix

The memory usage for a single user process is limited to 32 GB on Helix. If you estimate that your velvet run requires more than 32GB, or if you need to run several simultaneous velvet jobs, you should run on Biowulf (see below). In addition, while Velvet has been compiled enabling OpenMP based parallelization, the environment module for velvet sets OMP_NUM_THREADS to 2 to limit velvet to 3 threads total when running on helix.

Load module file to set up the environment

helix$ module avail velvet

----------------------------- /usr/local/lmod/modulefiles -----------------------------
   velvet/1.2.10

helix$ module load velvet/1.2.10

The following example session will make use of data from Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. This data contains short insert paired end reads (SRR292678) and long insert mate pair reads (SRR292770), which have to be reverse complemented since velvet expects reads to be in FR (inwards) orientation. This data can be obtained and preprocessed as follows:

helix$ cd /data/$USER/test_data
helix$ module load sratoolkit
helix$ # paired end data with nominal insert size of 6kb
helix$ fastq-dump --gzip --split-files -O fastq SRR292770
helix$ # paired end data with nominal insert size of 470nts 
helix$ fastq-dump -O fastq --gzip --split-files SRR292678
helix$ cd fastq
helix$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC  \
           --minimum-length 30 -q 20,20 \
           -o SRR292770_filt_1.fastq.gz -p SRR292770_filt_2.fastq.gz \
           SRR292770_1.fastq.gz SRR292770_2.fastq.gz
helix$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC  \
           --minimum-length 30 -q 20,20 \
           -o SRR292678_filt_1.fastq.gz -p SRR292678_filt_2.fastq.gz \
           SRR292678_1.fastq.gz SRR292678_2.fastq.gz
helix$ zcat SRR292770_filt_1.fastq.gz \
               | fastx_reverse_complement -z -o SRR292770_filt_rc_1.fastq.gz
helix$ zcat SRR292770_filt_2.fastq.gz \
               | fastx_reverse_complement -z -o SRR292770_filt_rc_2.fastq.gz
helix$ cd ..

The assembly with Velvet is a two step process - first creating hashes with velveth. One important parameter is the kmer size. In this example, we will try kmer sizes from 37 to 45. Note that kmer sizes have to be odd and that libraries should be ordered from smallest to largest insert size. Note also that each library goes into it's own channel as indicated by the number after the read type option:

helix$ mkdir velvet && cd velvet
helix$ velveth auto 37,47,2 \
    -shortPaired -fastq.gz -separate \
        ../fastq/SRR292678_filt_1.fastq.gz ../fastq/SRR292678_filt_2.fastq.gz \
    -shortPaired2 -fastq.gz -separate \
        ../fastq/SRR292770_filt_rc_1.fastq.gz ../fastq/SRR292770_filt_rc_2.fastq.gz

Next, velvetg builds the actual graph and does the assembly. Note that the values for insert length and sd were taken from the publication

helix$ velvetg auto_37 -exp_cov auto -cov_cutoff auto \
           -ins_length 468 -ins_length_sd 31 -ins_length2 6193 \
           -ins_length2_sd 566 -shortMatePaired2 yes
....
[565.945619] Estimated Coverage = 136.638609
[565.945642] Estimated Coverage cutoff = 68.319305
Final graph has 611 nodes and n50 of 1451863, max 1806834, total 5455514, using 20748350/22424018 reads
helix$ velvetg auto_39 -exp_cov auto -cov_cutoff auto \
           -ins_length 468 -ins_length_sd 31 -ins_length2 6193 \
           -ins_length2_sd 566 -shortMatePaired2 yes
...
[521.462552] Estimated Coverage = 128.491151
[521.462575] Estimated Coverage cutoff = 64.245575
Final graph has 562 nodes and n50 of 3226047, max 3226047, total 5467489, using 20530065/22424018 reads

Running velvetg on the other kmer lengths results in the following N50s:

kmer N50
37 1,451,863
39 3,226,047
41 1,073,453
43 963,264
45 587,887

Suggesting that a kmer length of 39 may be optimal.

Running a single velvet batch job on Biowulf

Below is an example script for running a velvet batch job on biowulf. Note that job parameters in this example are set in the script. This is equivalent to setting them using command line options.

#! /bin/bash
#SBATCH -J velvet_test
#SBATCH --mem=12g
#SBATCH -c 8

set -e 

cd /data/$USER/test_data/velvet
module load velvet
velveth k39 39 \
    -shortPaired -fastq.gz -separate \
        ../fastq/SRR292678_filt_1.fastq.gz ../fastq/SRR292678_filt_2.fastq.gz \
    -shortPaired2 -fastq.gz -separate \
        ../fastq/SRR292770_filt_rc_1.fastq.gz ../fastq/SRR292770_filt_rc_2.fastq.gz
velvetg k39 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes

This job can then be submitted with

biowulf$ sbatch batchscript
biowulf$ jobload -u $USER
     JOBID      RUNTIME     NODES   CPUS    AVG CPU%            MEMORY
                                                              Used/Alloc
     17513     00:02:34      p999      8       50.00      1.7 GB/12.0 GB

Velvet will try to use as many threads as possible. In the case of a job submitted to the slurm batch or an interactive job, this will be the number of cores allocated (-c / --cpus-per-task). The number of threads therefore does not need to be set in the batch script. To use a smaller number of threads, the OMP_NUM_THREADS environment variable can be defined in the script:

#! /bin/bash
#SBATCH -J velvet_test
#SBATCH --mem=12g
#SBATCH -c 8

set -e 
export OMP_NUM_THREADS=6
cd /data/$USER/test_data/velvet
module load velvet
velveth k39 39 \
...

There is a limit to the amount of time that can be saved by adding more cores to the calculations:

cores time [s]
2 639
4 411
6 348
8 319

This is likely a combination of deminishing returns with increasing threads and the fact that only parts of the algorithm are parallel.

Running a swarm of velvet batch jobs on Biowulf

To run a number of velvet jobs set up a swarm command file with one command per line (allowing for line continuations). For example the followign command file will run velvetg on different size kmers in parallel:

velvetg auto_39 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes
velvetg auto_41 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes
velvetg auto_43 -exp_cov auto -cov_cutoff auto \
    -ins_length 468 -ins_length_sd 31 \
    -ins_length2 6193 -ins_length2_sd 566 -shortMatePaired2 yes

The swarm file is submitted with

biowulf$ swarm -f swarm_cmd_file -t 4 -g 10 --module velvet

and swarm will create one job per command and submit them as a job array.

Running an interactive job on Biowulf

After allocating an interactive session with sinteractive, velvet can be used essentially like described above for helix except that the limitation to 2 threads is removed:

biowulf$ sinteractive -c4 --mem=10g
salloc.exe: Granted job allocation 17520
slurm stepprolog here!
srun: error: x11: no local DISPLAY defined, skipping
                                                    Begin slurm taskprolog!
                                                    End slurm taskprolog!

node$ cd /data/$USER/test_data/velvet
node$ velveth k39 39 \
    -shortPaired -fastq.gz -separate \
        ../fastq/SRR292678_filt_1.fastq.gz ../fastq/SRR292678_filt_2.fastq.gz \
    -shortPaired2 -fastq.gz -separate \
        ../fastq/SRR292770_filt_rc_1.fastq.gz ../fastq/SRR292770_filt_rc_2.fastq.gz
node$ exit
biowulf$