KmerGenie estimates the best k-mer length for genome de novo assembly. Given a set of reads, KmerGenie first computes the k-mer abundance histogram for many values of k. Then, for each value of k, it predicts the number of distinct genomic k-mers in the dataset, and returns the k-mer length which maximizes this number.
Allocate an interactive session and run the program. Sample session:
[teacher@biowulf ~]$ sinteractive --gres lscratch:1 salloc.exe: Pending job allocation 63311518 salloc.exe: job 63311518 queued and waiting for resources salloc.exe: job 63311518 has been allocated resources salloc.exe: Granted job allocation 63311518 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3128 are ready for job srun: error: x11: no local DISPLAY defined, skipping [teacher@cn3128 ~]$ cd /lscratch/$SLURM_JOB_ID [teacher@cn3128 63311518]$ # Use GAGE data (http://gage.cbcb.umd.edu/data/index.html) as an example. [teacher@cn3128 63311518]$ wget \ > http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_{1,2}.fastq.gz \ > http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/shortjump_{1,2}.fastq.gz --2018-03-07 17:39:23-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz Resolving dtn07-e0... 10.1.200.243 Connecting to dtn07-e0|10.1.200.243|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 64204080 (61M) [application/x-gzip] Saving to: “frag_1.fastq.gz” 100%[===================================================================================>] 64,204,080 45.5M/s in 1.3s 2018-03-07 17:39:24 (45.5 MB/s) - “frag_1.fastq.gz” saved [64204080/64204080] --2018-03-07 17:39:24-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_2.fastq.gz Connecting to dtn07-e0|10.1.200.243|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 68386818 (65M) [application/x-gzip] Saving to: “frag_2.fastq.gz” 100%[===================================================================================>] 68,386,818 51.5M/s in 1.3s 2018-03-07 17:39:25 (51.5 MB/s) - “frag_2.fastq.gz” saved [68386818/68386818] --2018-03-07 17:39:25-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/shortjump_1.fastq.gz Connecting to dtn07-e0|10.1.200.243|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 74810444 (71M) [application/x-gzip] Saving to: “shortjump_1.fastq.gz” 100%[===================================================================================>] 74,810,444 84.5M/s in 0.8s 2018-03-07 17:39:26 (84.5 MB/s) - “shortjump_1.fastq.gz” saved [74810444/74810444] --2018-03-07 17:39:26-- http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/shortjump_2.fastq.gz Connecting to dtn07-e0|10.1.200.243|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 74594473 (71M) [application/x-gzip] Saving to: “shortjump_2.fastq.gz” 100%[===================================================================================>] 74,594,473 85.2M/s in 0.8s 2018-03-07 17:39:27 (85.2 MB/s) - “shortjump_2.fastq.gz” saved [74594473/74594473] FINISHED --2018-03-07 17:39:27-- Downloaded: 4 files, 269M in 4.3s (62.7 MB/s) [teacher@cn3128 63311518]$ ls *.fastq.gz > inputs [teacher@cn3128 63311518]$ module load kmergenie [+] Loading GSL 2.2.1 ... [+] Loading Graphviz v2.38.0 ... [+] Loading gdal 2.0 ... [+] Loading proj 4.9.2 ... [+] Loading gcc 4.9.1 ... [+] Loading openmpi 1.10.0 for GCC 4.9.1 [+] Loading tcl_tk 8.6.3 [+] Loading Zlib 1.2.8 ... [+] Loading Bzip2 1.0.6 ... [+] Loading pcre 8.38 ... [+] Loading liblzma 5.2.2 ... [-] Unloading Zlib 1.2.8 ... [+] Loading Zlib 1.2.8 ... [-] Unloading liblzma 5.2.2 ... [+] Loading liblzma 5.2.2 ... [+] Loading libjpeg-turbo 1.5.1 ... [+] Loading tiff 4.0.7 ... [+] Loading curl 7.46.0 ... [+] Loading boost libraries v1.65 ... [+] Loading R 3.4.0 on cn3128 [-] Unloading Zlib 1.2.8 ... [+] Loading Zlib 1.2.8 ... [+] Loading kmergenie, version 1.7044... [teacher@cn3128 63311518]$ kmergenie inputs -t2 running histogram estimation list of reads: frag_1.fastq.gz frag_2.fastq.gz shortjump_1.fastq.gz shortjump_2.fastq.gz Setting maximum kmer length to: 101 bp computing histograms (from k=21 to k=101): 41 21 51 31 61 81 71 91 101 ntCard wall-clock time over all k values: 88 seconds fitting model to histograms to estimate best k could not fit histograms-k71.histo could not fit histograms-k81.histo estimation of the best k so far: 21 refining estimation around [15; 27], with a step of 2 running histogram estimation list of reads: frag_1.fastq.gz frag_2.fastq.gz shortjump_1.fastq.gz shortjump_2.fastq.gz Setting maximum kmer length to: 101 bp computing histograms (from k=17 to k=27): 19 17 23 21 27 25 ntCard wall-clock time over all k values: 66 seconds fitting model to histograms to estimate best k could not fit histograms-k71.histo could not fit histograms-k81.histo table of predicted num. of genomic k-mers: histograms.dat recommended coverage cut-off for best k: 3 best k: 19 [teacher@cn3128 63311518]$ exit salloc.exe: Relinquishing job allocation 63311518 [teacher@biowulf ~]$
Create a batch input file (e.g. kmergenie.sh). For example:
#!/bin/sh set -e module load kmergenie test -n "$SLURM_CPUS_PER_TASK" || SLURM_CPUS_PER_TASK=2 kmergenie inputs.fofn -t $SLURM_CPUS_PER_TASK
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] kmergenie.sh
Create a swarmfile (e.g. kmergenie.swarm). For example:
kmergenie input1.fofn -t $SLURM_CPUS_PER_TASK kmergenie input2.fofn -t $SLURM_CPUS_PER_TASK kmergenie input3.fofn -t $SLURM_CPUS_PER_TASK kmergenie input4.fofn -t $SLURM_CPUS_PER_TASK
Submit this job using the swarm command.
swarm -f kmergenie.swarm -t 2 [-g #] --module kmergeniewhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module kmergenie | Loads the kmergenie module for each subjob in the swarm |