Biowulf High Performance Computing at the NIH
seqkit on Biowulf

Seqkit is a tool for manipulating fasta and fastq files. It includes a number of different tools:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load seqkit
[user@cn3144]$ seqkit help 
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.10.2

Author: Wei Shen 

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962

Usage:
  seqkit [command]

Available Commands:
  common          find common sequences of multiple files by id/name/sequence
  concat          concatenate sequences with same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  duplicate       duplicate sequences N times
  faidx           create FASTA index file and extract subsequence
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (with length/GC content/GC skew)
  genautocomplete generate shell autocompletion script
  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed
  head            print first N FASTA/Q records
  help            Help about any command
  locate          locate subsequences/motifs, mismatch allowed
  mutate          edit sequence (point mutation, insertion, deletion)
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by id/name/sequence
  sample          sample sequences by number or proportion
  seq             transform sequences (revserse, complement, extract ID...)
  shuffle         shuffle sequences
  sliding         sliding sequences, circular genome supported
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)
  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  tab2fx          convert tabular format to FASTA/Q format
  translate       translate DNA/RNA to protein sequence (supporting ambiguous bases)
  version         print version information and check for update

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

Use "seqkit [command] --help" for more information about a command.

[user@cn3144]$ cp $SEQKIT_TEST_DATA/* .
[user@cn3144]$ ls -lh
total 13M
-rw-rw-r-- 1 user group 1.5M Mar 11  2018 hairpin.fa.gz
-rw-rw-r-- 1 user group  12M Sep 10 14:51 read1_250k.fastq.gz

[user@cn3144]$ seqkit stat hairpin.fa.gz
file           format  type  num_seqs    sum_len  min_len  avg_len  max_len
hairpin.fa.gz  FASTA   RNA     38,589  3,729,811       39     96.7    2,354

[user@cn3144]$ gunzip hairpin.fa.gz
[user@cn3144]$ seqkit sort --by-length --reverse --two-pass hairpin.fa > hairpin_sorted.fa
[user@cn3144]$ seqkit fx2tab --length hairpin_sorted.fa --name --only-id | head
atr-MIR8591			2354
eun-MIR10219			1545
atr-MIR8598			1411
aly-MIR858			938
mdm-MIR858			921
mtr-MIR7700			910
cre-MIR914			894
eun-MIR10211a			820
eun-MIR10211b			820
atr-MIR8612			805

[user@cn3144]$ seqkit rmdup -s -i hairpin.fa -o clean.fa.gz -D dups.txt
[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. seqkit.sh), which uses the input file 'seqkit.in'. For example:

#!/bin/bash
module load seqkit || exit 1
seqkit split $SEQKIR_TEST_DATA/hairpin.fa.gz -i --id-regexp "^([\w]+)\-" --two-pass

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=2 --mem=4g seqkit.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. seqkit.swarm). For example:

seqkit -B -g -G -l -n file1.fa
seqkit -B -g -G -l -n file2.fa
seqkit -B -g -G -l -n file3.fa

Submit this job using the swarm command.

swarm -f seqkit.swarm -g 4 -t 1 --module seqkit
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module seqkit Loads the seqkit module for each subjob in the swarm