Segemehl is a short read aligner that allows local alignment and can align reads obtained after bisulfite treatment.
There are no centrally maintained indices for segemehl. If you use segemehl, please build your own indices.
Allocate an interactive session and run the program. Sample session:
Segemehl is a multithreaded program ( --threads,-t).
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load segemehl/0.2.0
[user@cn3144 ~]$ segemehl.x -h
usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>]
[-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>]
[-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>] [-M <n>] [-r <n>] [-S]
[--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>]
[-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>]
Heuristic mapping of short sequences
[INPUT]
-d, --database <file> [<file>] list of path/filename(s) of database sequence(s)
-q, --query <file> path/filename of query sequences (default:none)
[...snip...]
Get some example data and build a index of an e. coli genome:
[user@cn3144 ~]$ cd /data/$USER/test [user@cn3144 ~]$ cp /usr/local/apps/segemehl/TEST_DATA/* . [user@cn3144 ~]$ segemehl.x -x ecoli_mg1655.idx -d ecoli_mg1655.fa.gz
Multiple fasta filles can be provided as a space separated list. Note that this process take a lot of memory. Building a single human chromosome can can take up to 8GB of memory and 4GB of disk space. With 32GB a single index containing half the human chromosomes can be built.
A simple alignment with default parameters is done with
[user@cn3144 ~]$ segemehl.x -i ecoli_mg1655.idx -d ecoli_mg1655.fa.gz \ -q SRR1536586.fastq.gz -t 2 > aln.sam [...snip...] [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Set up a batch script similar to this:
#! /bin/bash
set -e
module load segemehl/0.2.0 || exit 80
module load samtools/1.2 || exit 81
segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa.gz \
-q SRR1536586.fastq.gz -t 8 \
| samtools view -b -@2 - \
> aln.bam
Submit from biowulf with
b2$ sbatch --cpus-per-task=10 --mem=4G segemehl_batch.sh
3006951
b2$ jobload -j 3006951
JOBID TIME NODES CPUS THREADS LOAD MEMORY
Alloc Running Used/Alloc
3006951 1:36 cn0104 10 9 90% 226.8 MB/4.0 GB
b2$ jobhist 3006951
JobId : 3006951
User : user
Submitted : 20150929 14:36:41
Submission Path : /spin1/users/user/test_data/segemehl
Submission Command : sbatch --cpus-per-task=10 --mem=4G segemehl_batch.sh
Partition State Nodes CPUs Walltime Runtime MemReq MemUsed Nodelist
norm COMPLETED 1 10 04:00:00 00:07:26 4.0GB/node 0.2GB cn0104
Note that 10 CPUs are allocated to account for the 8 threads used by
the aligner plus the 2 threads used by samtools view. Note
also that without --silent slurm output files will be
very large.
Create a swarm command file similar to
segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \ -q sample1.fastq.gz -t 8 > aln1.sam segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \ -q sample2.fastq.gz -t 8 > aln2.sam segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \ -q sample3.fastq.gz -t 8 > aln3.sam
And submit a swarm of parallel jobs with
swarm -f segemehl_swarm.cmd -g 2 -t 8
segemehl.x: required option 'database' (d) missing
usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>]
[-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>]
[-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>] [-M <n>] [-r <n>] [-S]
[--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>]
[-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>]
Heuristic mapping of short sequences
[INPUT]
-d, --database <file> [<file>] list of path/filename(s) of database sequence(s)
-q, --query <file> path/filename of query sequences (default:none)
-p, --mate <file> path/filename of mate pair sequences (default:none)
-i, --index <file> path/filename of db index (default:none)
-j, --index2 <file> path/filename of second db index (default:none)
-x, --generate <file> generate db index and store to disk (default:none)
-y, --generate2 <file> generate second db index and store to disk (default:none)
-B, --filebins <string> file bins with basename <string> for easier data handling
(default:none)
-F, --bisulfite <n> bisulfite mapping with methylC-seq/Lister et al. (=1) or
bs-seq/Cokus et al. protocol (=2) (default:0)
[GENERAL]
-m, --minsize <n> minimum size of queries (default:12)
-s, --silent shut up!
-b, --brief brief output
-c, --checkidx check index
-t, --threads <n> start <n> threads (default:1)
-o, --outfile <string> outputfile (default:none)
-u, --nomatchfilename <file> filename for unmatched reads (default:none)
[SEEDPARAMS]
-D, --differences <n> search seeds initially with <n> differences (default:1)
-J, --jump <n> search seeds with jump size <n> (0=automatic) (default:0)
-E, --evalue <double> max evalue (default:5.000000)
-w, --maxsplitevalue <double> max evalue for splits (default:50.000000)
-M, --maxinterval <n> maximum width of a suffix array interval, i.e. a query seed will be
omitted if it matches more than <n> times (default:100)
-r, --maxout <n> maximum number of alignments that will be reported. If set to zero,
all alignments will be reported (default:0)
-S, --splits detect split/spliced reads (default:none)
-K, --SEGEMEHL output SEGEMEHL format (needs to be selected for brief)
-V, --MEOP output MEOP field for easier variance calling in SAM (XE:Z:)
--nohead do not output header
[SEEDEXTENSIONPARAMS]
-e, --extensionscore <n> score of a match during extension (default:2)
-n, --extensionpenalty <n> penalty for a mismatch during extension (default:4)
-X, --dropoff <n> dropoff parameter for extension (default:8)
[ALIGNPARAMS]
-A, --accuracy <n> min percentage of matches per read in semi-global alignment
(default:90)
-W, --minsplicecover <n> min coverage for spliced transcripts (default:80)
-U, --minfragscore <n> min score of a spliced fragment (default:18)
-Z, --minfraglen <n> min length of a spliced fragment (default:20)
-l, --splicescorescale <f> report spliced alignment with score s only if <f>*s is larger than
next best spliced alignment (default:1.000000)
-H, --hitstrategy report only best scoring hits (=1) or all (=0) (default:1)
--showalign show alignments
-P, --prime5 <string> add 5' adapter (default:none)
-Q, --prime3 <string> add 3' adapter (default:none)
-R, --clipacc <n> clipping accuracy (default:70)
-T, --polyA clip polyA tail
-Y, --autoclip autoclip unknown 3prime adapter
-C, --hardclip enable hard clipping
-O, --order sorts the output by chromsome and position (might take a while!)
-I, --maxinsertsize <n> maximum size of the inserts (paired end) (default:5000)
[VERSION]
0.2.0-418 (2015-01-05 05:17:35 -0500 (Mon, 05 Jan 2015))
[REFERENCES]
SEGEMEHL is free software for non-commercial use
(C) 2008 Bioinformatik Leipzig