Segemehl is a short read aligner that allows local alignment and can align reads obtained after bisulfite treatment.
There are no centrally maintained indices for segemehl. If you use segemehl, please build your own indices.
Allocate an interactive session and run the program. Sample session:
Segemehl is a multithreaded program ( --threads,-t
).
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load segemehl/0.2.0 [user@cn3144 ~]$ segemehl.x -h usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>] [-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>] [-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>] [-M <n>] [-r <n>] [-S] [--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>] [-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>] Heuristic mapping of short sequences [INPUT] -d, --database <file> [<file>] list of path/filename(s) of database sequence(s) -q, --query <file> path/filename of query sequences (default:none) [...snip...]
Get some example data and build a index of an e. coli genome:
[user@cn3144 ~]$ cd /data/$USER/test [user@cn3144 ~]$ cp /usr/local/apps/segemehl/TEST_DATA/* . [user@cn3144 ~]$ segemehl.x -x ecoli_mg1655.idx -d ecoli_mg1655.fa.gz
Multiple fasta filles can be provided as a space separated list. Note that this process take a lot of memory. Building a single human chromosome can can take up to 8GB of memory and 4GB of disk space. With 32GB a single index containing half the human chromosomes can be built.
A simple alignment with default parameters is done with
[user@cn3144 ~]$ segemehl.x -i ecoli_mg1655.idx -d ecoli_mg1655.fa.gz \ -q SRR1536586.fastq.gz -t 2 > aln.sam [...snip...] [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Set up a batch script similar to this:
#! /bin/bash set -e module load segemehl/0.2.0 || exit 80 module load samtools/1.2 || exit 81 segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa.gz \ -q SRR1536586.fastq.gz -t 8 \ | samtools view -b -@2 - \ > aln.bam
Submit from biowulf with
b2$ sbatch --cpus-per-task=10 --mem=4G segemehl_batch.sh 3006951 b2$ jobload -j 3006951 JOBID TIME NODES CPUS THREADS LOAD MEMORY Alloc Running Used/Alloc 3006951 1:36 cn0104 10 9 90% 226.8 MB/4.0 GB b2$ jobhist 3006951 JobId : 3006951 User : user Submitted : 20150929 14:36:41 Submission Path : /spin1/users/user/test_data/segemehl Submission Command : sbatch --cpus-per-task=10 --mem=4G segemehl_batch.sh Partition State Nodes CPUs Walltime Runtime MemReq MemUsed Nodelist norm COMPLETED 1 10 04:00:00 00:07:26 4.0GB/node 0.2GB cn0104
Note that 10 CPUs are allocated to account for the 8 threads used by
the aligner plus the 2 threads used by samtools view
. Note
also that without --silent
slurm output files will be
very large.
Create a swarm command file similar to
segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \ -q sample1.fastq.gz -t 8 > aln1.sam segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \ -q sample2.fastq.gz -t 8 > aln2.sam segemehl.x --silent -i ecoli_mg1655.idx -d ecoli_mg1655.fa \ -q sample3.fastq.gz -t 8 > aln3.sam
And submit a swarm of parallel jobs with
swarm -f segemehl_swarm.cmd -g 2 -t 8
segemehl.x: required option 'database' (d) missing usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>] [-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>] [-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>] [-M <n>] [-r <n>] [-S] [--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>] [-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>] Heuristic mapping of short sequences [INPUT] -d, --database <file> [<file>] list of path/filename(s) of database sequence(s) -q, --query <file> path/filename of query sequences (default:none) -p, --mate <file> path/filename of mate pair sequences (default:none) -i, --index <file> path/filename of db index (default:none) -j, --index2 <file> path/filename of second db index (default:none) -x, --generate <file> generate db index and store to disk (default:none) -y, --generate2 <file> generate second db index and store to disk (default:none) -B, --filebins <string> file bins with basename <string> for easier data handling (default:none) -F, --bisulfite <n> bisulfite mapping with methylC-seq/Lister et al. (=1) or bs-seq/Cokus et al. protocol (=2) (default:0) [GENERAL] -m, --minsize <n> minimum size of queries (default:12) -s, --silent shut up! -b, --brief brief output -c, --checkidx check index -t, --threads <n> start <n> threads (default:1) -o, --outfile <string> outputfile (default:none) -u, --nomatchfilename <file> filename for unmatched reads (default:none) [SEEDPARAMS] -D, --differences <n> search seeds initially with <n> differences (default:1) -J, --jump <n> search seeds with jump size <n> (0=automatic) (default:0) -E, --evalue <double> max evalue (default:5.000000) -w, --maxsplitevalue <double> max evalue for splits (default:50.000000) -M, --maxinterval <n> maximum width of a suffix array interval, i.e. a query seed will be omitted if it matches more than <n> times (default:100) -r, --maxout <n> maximum number of alignments that will be reported. If set to zero, all alignments will be reported (default:0) -S, --splits detect split/spliced reads (default:none) -K, --SEGEMEHL output SEGEMEHL format (needs to be selected for brief) -V, --MEOP output MEOP field for easier variance calling in SAM (XE:Z:) --nohead do not output header [SEEDEXTENSIONPARAMS] -e, --extensionscore <n> score of a match during extension (default:2) -n, --extensionpenalty <n> penalty for a mismatch during extension (default:4) -X, --dropoff <n> dropoff parameter for extension (default:8) [ALIGNPARAMS] -A, --accuracy <n> min percentage of matches per read in semi-global alignment (default:90) -W, --minsplicecover <n> min coverage for spliced transcripts (default:80) -U, --minfragscore <n> min score of a spliced fragment (default:18) -Z, --minfraglen <n> min length of a spliced fragment (default:20) -l, --splicescorescale <f> report spliced alignment with score s only if <f>*s is larger than next best spliced alignment (default:1.000000) -H, --hitstrategy report only best scoring hits (=1) or all (=0) (default:1) --showalign show alignments -P, --prime5 <string> add 5' adapter (default:none) -Q, --prime3 <string> add 3' adapter (default:none) -R, --clipacc <n> clipping accuracy (default:70) -T, --polyA clip polyA tail -Y, --autoclip autoclip unknown 3prime adapter -C, --hardclip enable hard clipping -O, --order sorts the output by chromsome and position (might take a while!) -I, --maxinsertsize <n> maximum size of the inserts (paired end) (default:5000) [VERSION] 0.2.0-418 (2015-01-05 05:17:35 -0500 (Mon, 05 Jan 2015)) [REFERENCES] SEGEMEHL is free software for non-commercial use (C) 2008 Bioinformatik Leipzig