High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Seqtk on Biowulf & Helix

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

Running on Helix

$ module load seqtk
$ cd /data/$USER/dir
$ seqtk
Usage:   seqtk  
Version: 1.0-r75-dirty

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
         sample    subsample sequences
         subseq    extract subsequences from FASTA/Q
         fqchk     fastq QC (base/quality summary)
         mergepe   interleave two PE FASTA/Q files
         trimfq    trim FASTQ using the Phred algorithm

         hety      regional heterozygosity
         mutfa     point mutate FASTA at specified positions
         mergefa   merge two FASTA/Q files
         dropse    drop unpaired from interleaved PE FASTA/Q
         randbase  choose a random base from hets
         cutN      cut sequence at long N
         listhet   extract the position of each het

$ seqtk seq -a in.fq.gz > out.fa

Running a single batch job on Biowulf

1. Create a script file. The file will contain the lines similar to the lines below.

#!/bin/bash


module load seqtk
cd /data/$USER/dir
seqtk seq -a in.fq.gz > out.fa

2. Submit the script on biowulf:

$ sbatch jobscript

If more momory is required (default 4gb), specify --mem=Mg, for example --mem=10g:

$ sbatch --mem=10g jobscript

Running a swarm of jobs on Biowulf

Setup a swarm command file:

  cd /data/$USER/dir1; seqtk seq -a in.fq.gz > out.fa
  cd /data/$USER/dir2; seqtk seq -a in.fq.gz > out.fa
  cd /data/$USER/dir3; seqtk seq -a in.fq.gz > out.fa
	[......]
  

Submit the swarm file, -f specify the swarmfile name, and --module will be loaded the required module for each command line in the file:

  $ swarm -f swarmfile --module seqtk

If more memory is needed for each line of commands, the below example allocate 10g for each command:

  $ swarm -f swarmfile -g 10 --module seqtk

For more information regarding running swarm, see swarm.html

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive 
salloc.exe: Granted job allocation 16535

cn999$ module load seqtk
cn999$ cd /data/$USER/dir
cn999$ seqtk seq -a in.fq.gz > out.fa
[...etc...]

cn999$ exit
exit

biowulf$

Make sure to exit the job once finished.

If more memory is needed, use --mem. For example

biowulf$ sinteractive --mem=8g

Documentation

https://github.com/lh3/seqtk