KmerGenie estimates the best k-mer length for genome de novo assembly. Given a set of reads, KmerGenie first computes the k-mer abundance histogram for many values of k. Then, for each value of k, it predicts the number of distinct genomic k-mers in the dataset, and returns the k-mer length which maximizes this number.
Allocate an interactive session and run the program. Sample session:
[teacher@biowulf ~]$ sinteractive --gres lscratch:1
salloc.exe: Pending job allocation 63311518
salloc.exe: job 63311518 queued and waiting for resources
salloc.exe: job 63311518 has been allocated resources
salloc.exe: Granted job allocation 63311518
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3128 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
[teacher@cn3128 ~]$ cd /lscratch/$SLURM_JOB_ID
[teacher@cn3128 63311518]$ # Use GAGE data (http://gage.cbcb.umd.edu/data/index.html) as an example.
[teacher@cn3128 63311518]$ wget \
> http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_{1,2}.fastq.gz \
> http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/shortjump_{1,2}.fastq.gz
--2018-03-07 17:39:23-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz
Resolving dtn07-e0... 10.1.200.243
Connecting to dtn07-e0|10.1.200.243|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 64204080 (61M) [application/x-gzip]
Saving to: “frag_1.fastq.gz”
100%[===================================================================================>] 64,204,080 45.5M/s in 1.3s
2018-03-07 17:39:24 (45.5 MB/s) - “frag_1.fastq.gz” saved [64204080/64204080]
--2018-03-07 17:39:24-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_2.fastq.gz
Connecting to dtn07-e0|10.1.200.243|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 68386818 (65M) [application/x-gzip]
Saving to: “frag_2.fastq.gz”
100%[===================================================================================>] 68,386,818 51.5M/s in 1.3s
2018-03-07 17:39:25 (51.5 MB/s) - “frag_2.fastq.gz” saved [68386818/68386818]
--2018-03-07 17:39:25-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/shortjump_1.fastq.gz
Connecting to dtn07-e0|10.1.200.243|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 74810444 (71M) [application/x-gzip]
Saving to: “shortjump_1.fastq.gz”
100%[===================================================================================>] 74,810,444 84.5M/s in 0.8s
2018-03-07 17:39:26 (84.5 MB/s) - “shortjump_1.fastq.gz” saved [74810444/74810444]
--2018-03-07 17:39:26-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/shortjump_2.fastq.gz
Connecting to dtn07-e0|10.1.200.243|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 74594473 (71M) [application/x-gzip]
Saving to: “shortjump_2.fastq.gz”
100%[===================================================================================>] 74,594,473 85.2M/s in 0.8s
2018-03-07 17:39:27 (85.2 MB/s) - “shortjump_2.fastq.gz” saved [74594473/74594473]
FINISHED --2018-03-07 17:39:27--
Downloaded: 4 files, 269M in 4.3s (62.7 MB/s)
[teacher@cn3128 63311518]$ ls *.fastq.gz > inputs
[teacher@cn3128 63311518]$ module load kmergenie
[+] Loading GSL 2.2.1 ...
[+] Loading Graphviz v2.38.0 ...
[+] Loading gdal 2.0 ...
[+] Loading proj 4.9.2 ...
[+] Loading gcc 4.9.1 ...
[+] Loading openmpi 1.10.0 for GCC 4.9.1
[+] Loading tcl_tk 8.6.3
[+] Loading Zlib 1.2.8 ...
[+] Loading Bzip2 1.0.6 ...
[+] Loading pcre 8.38 ...
[+] Loading liblzma 5.2.2 ...
[-] Unloading Zlib 1.2.8 ...
[+] Loading Zlib 1.2.8 ...
[-] Unloading liblzma 5.2.2 ...
[+] Loading liblzma 5.2.2 ...
[+] Loading libjpeg-turbo 1.5.1 ...
[+] Loading tiff 4.0.7 ...
[+] Loading curl 7.46.0 ...
[+] Loading boost libraries v1.65 ...
[+] Loading R 3.4.0 on cn3128
[-] Unloading Zlib 1.2.8 ...
[+] Loading Zlib 1.2.8 ...
[+] Loading kmergenie, version 1.7044...
[teacher@cn3128 63311518]$ kmergenie inputs -t2
running histogram estimation
list of reads:
frag_1.fastq.gz
frag_2.fastq.gz
shortjump_1.fastq.gz
shortjump_2.fastq.gz
Setting maximum kmer length to: 101 bp
computing histograms (from k=21 to k=101): 41 21 51 31 61 81 71 91 101
ntCard wall-clock time over all k values: 88 seconds
fitting model to histograms to estimate best k
could not fit histograms-k71.histo
could not fit histograms-k81.histo
estimation of the best k so far: 21
refining estimation around [15; 27], with a step of 2
running histogram estimation
list of reads:
frag_1.fastq.gz
frag_2.fastq.gz
shortjump_1.fastq.gz
shortjump_2.fastq.gz
Setting maximum kmer length to: 101 bp
computing histograms (from k=17 to k=27): 19 17 23 21 27 25
ntCard wall-clock time over all k values: 66 seconds
fitting model to histograms to estimate best k
could not fit histograms-k71.histo
could not fit histograms-k81.histo
table of predicted num. of genomic k-mers: histograms.dat
recommended coverage cut-off for best k: 3
best k: 19
[teacher@cn3128 63311518]$ exit
salloc.exe: Relinquishing job allocation 63311518
[teacher@biowulf ~]$
Create a batch input file (e.g. kmergenie.sh). For example:
#!/bin/sh set -e module load kmergenie test -n "$SLURM_CPUS_PER_TASK" || SLURM_CPUS_PER_TASK=2 kmergenie inputs.fofn -t $SLURM_CPUS_PER_TASK
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] kmergenie.sh
Create a swarmfile (e.g. kmergenie.swarm). For example:
kmergenie input1.fofn -t $SLURM_CPUS_PER_TASK kmergenie input2.fofn -t $SLURM_CPUS_PER_TASK kmergenie input3.fofn -t $SLURM_CPUS_PER_TASK kmergenie input4.fofn -t $SLURM_CPUS_PER_TASK
Submit this job using the swarm command.
swarm -f kmergenie.swarm -t 2 [-g #] --module kmergeniewhere
| -g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
| -t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
| --module kmergenie | Loads the kmergenie module for each subjob in the swarm |