High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
segemehl on Biowulf & Helix

Description

Segemehl is a short read aligner that allows local alignment and can align reads obtained after bisulfite treatment.

There are no centrally maintained indices for segemehl. If you use segemehl, please build your own indices.

References

Web sites

On Helix

Segemehl is a multithreaded program ( --threads,-t). Please do not use more than two threads on helix. Any serious work should be done in an interactive allocation or a batch/swarm job.

helix$ module load segemehl/0.2.0
helix$ segemehl.x -h
usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>] 
                  [-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>]  
                  [-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>] [-M <n>] [-r <n>] [-S]
                  [--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>]       
                  [-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>]       
                                                                                                  
  Heuristic mapping of short sequences                                                            
                                                                                                  
 [INPUT]                                                                                          
 -d, --database <file> [<file>]  list of path/filename(s) of database sequence(s)                 
 -q, --query <file>              path/filename of query sequences (default:none)                  
[...snip...]

Get some example data and build a index of an e. coli genome:

helix$ cd /data/$USER/test    
helix$ cp /usr/local/apps/segemehl/TEST_DATA/* .
helix$ segemehl.x -x ecoli_mg1655.idx -d ecoli_mg1655.fa.gz

Multiple fasta filles can be provided as a space separated list. Note that this process take a lot of memory. Building a single human chromosome can can take up to 8GB of memory and 4GB of disk space. With 32GB a single index containing half the human chromosomes can be built.

A simple alignment with default parameters is done with

helix$ segemehl.x -i ecoli_mg1655.idx -d ecoli_mg1655.fa.gz \
  -q SRR1536586.fastq.gz -t 2 > aln.sam
[...snip...]
Batch job on Biowulf

Set up a batch script similar to this:

#! /bin/bash
set -e

module load segemehl/0.2.0 || exit 80
module load samtools/1.2 || exit 81
segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa.gz \
    -q SRR1536586.fastq.gz -t 8 \
 | samtools view -b -@2 - \
 > aln.bam

Submit from biowulf with

b2$ sbatch --cpus-per-task=10 --mem=4G segemehl_batch.sh
3006951
b2$ jobload -j 3006951
 JOBID   TIME   NODES   CPUS  THREADS   LOAD             MEMORY
                       Alloc  Running                Used/Alloc
3006951   1:36  cn0104     10        9    90%    226.8 MB/4.0 GB
b2$ jobhist 3006951

JobId              : 3006951
User               : wresch
Submitted          : 20150929 14:36:41
Submission Path    : /spin1/users/wresch/test_data/segemehl
Submission Command : sbatch --cpus-per-task=10 --mem=4G segemehl_batch.sh

 Partition       State  Nodes  CPUs      Walltime       Runtime         MemReq  MemUsed  Nodelist
      norm   COMPLETED      1    10      04:00:00      00:07:26     4.0GB/node    0.2GB  cn0104

Note that 10 CPUs are allocated to account for the 8 threads used by the aligner plus the 2 threads used by samtools view. Note also that without --silent slurm output files will be very large.

Swarm of jobs on Biowulf

Create a swarm command file similar to

segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \
  -q sample1.fastq.gz -t 8 > aln1.sam
segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \
  -q sample2.fastq.gz -t 8 > aln2.sam
segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \
  -q sample3.fastq.gz -t 8 > aln3.sam

And submit a swarm of parallel jobs with

swarm -f segemehl_swarm.cmd -g 2 -t 8
Interactive job on Biowulf

Allocate an interactive job with

b2$ sinteractive --mem=4G --cpus-per-task=10
salloc.exe: Pending job allocation 3008082
salloc.exe: job 3008082 queued and waiting for resources
salloc.exe: job 3008082 has been allocated resources
salloc.exe: Granted job allocation 3008082
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn0094 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
cn0094$ module load segemehl/0.2.0

and use as described above for interactive use on helix.

Documentation

Command line help

segemehl.x: required option 'database' (d) missing
usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>] 
                  [-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>] 
                  [-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>] [-M <n>] [-r <n>] [-S] 
                  [--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>] 
                  [-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>] 
                  
  Heuristic mapping of short sequences

 [INPUT]                         
 -d, --database <file> [<file>]  list of path/filename(s) of database sequence(s)
 -q, --query <file>              path/filename of query sequences (default:none)
 -p, --mate <file>               path/filename of mate pair sequences (default:none)
 -i, --index <file>              path/filename of db index (default:none)
 -j, --index2 <file>             path/filename of second db index (default:none)
 -x, --generate <file>           generate db index and store to disk (default:none)
 -y, --generate2 <file>          generate second db index and store to disk (default:none)
 -B, --filebins <string>         file bins with basename <string> for easier data handling 
                                 (default:none)
 -F, --bisulfite <n>             bisulfite mapping with methylC-seq/Lister et al. (=1) or 
                                 bs-seq/Cokus et al. protocol (=2) (default:0)
 [GENERAL]                       
 -m, --minsize <n>               minimum size of queries (default:12)
 -s, --silent                    shut up!
 -b, --brief                     brief output
 -c, --checkidx                  check index
 -t, --threads <n>               start <n> threads (default:1)
 -o, --outfile <string>          outputfile (default:none)
 -u, --nomatchfilename <file>    filename for unmatched reads (default:none)
 [SEEDPARAMS]                    
 -D, --differences <n>           search seeds initially with <n> differences (default:1)
 -J, --jump <n>                  search seeds with jump size <n> (0=automatic) (default:0)
 -E, --evalue <double>           max evalue (default:5.000000)
 -w, --maxsplitevalue <double>   max evalue for splits (default:50.000000)
 -M, --maxinterval <n>           maximum width of a suffix array interval, i.e. a query seed will be 
                                 omitted if it matches more than <n> times (default:100)
 -r, --maxout <n>                maximum number of alignments that will be reported. If set to zero, 
                                 all alignments will be reported (default:0)
 -S, --splits                    detect split/spliced reads (default:none)
 -K, --SEGEMEHL                  output SEGEMEHL format (needs to be selected for brief)
 -V, --MEOP                      output MEOP field for easier variance calling in SAM (XE:Z:)
 --nohead                        do not output header
 [SEEDEXTENSIONPARAMS]           
 -e, --extensionscore <n>        score of a match during extension (default:2)
 -n, --extensionpenalty <n>      penalty for a mismatch during extension (default:4)
 -X, --dropoff <n>               dropoff parameter for extension (default:8)
 [ALIGNPARAMS]                   
 -A, --accuracy <n>              min percentage of matches per read in semi-global alignment 
                                 (default:90)
 -W, --minsplicecover <n>        min coverage for spliced transcripts (default:80)
 -U, --minfragscore <n>          min score of a spliced fragment (default:18)
 -Z, --minfraglen <n>            min length of a spliced fragment (default:20)
 -l, --splicescorescale <f>      report spliced alignment with score s only if <f>*s is larger than 
                                 next best spliced alignment (default:1.000000)
 -H, --hitstrategy               report only best scoring hits (=1) or all (=0) (default:1)
 --showalign                     show alignments
 -P, --prime5 <string>           add 5' adapter (default:none)
 -Q, --prime3 <string>           add 3' adapter (default:none)
 -R, --clipacc <n>               clipping accuracy (default:70)
 -T, --polyA                     clip polyA tail
 -Y, --autoclip                  autoclip unknown 3prime adapter
 -C, --hardclip                  enable hard clipping
 -O, --order                     sorts the output by chromsome and position (might take a while!)
 -I, --maxinsertsize <n>         maximum size of the inserts (paired end) (default:5000)
 [VERSION]
  0.2.0-418 (2015-01-05 05:17:35 -0500 (Mon, 05 Jan 2015))
 [BUGS]
  Please report bugs to steve@bioinf.uni-leipzig.de
 [REFERENCES]
  SEGEMEHL is free software for non-commercial use 
  (C) 2008 Bioinformatik Leipzig

Web sites