MashMap is an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, it computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=8g --cpus-per-task=16 [user@cn3200 ~]$module load MashMap [+] Loading GSL 2.5 for GCC 7.2.0 ... [+] Loading MashMap 2.0 [user@cn3200 ~]$ mashmap -h ----------------- Mashmap is an approximate long read or contig mapper based on Jaccard similarity ----------------- Example usage: $ mashmap -r ref.fa -q seq.fq [OPTIONS] $ mashmap --rl reference_files_list.txt -q seq.fq [OPTIONS] Available options ----------------- -h, --help Print this help page -rDownload testing data from the PacBio's release of Human 54x long-read coverage dataset, --ref an input reference file (fasta/fastq)[.gz] --refList , --rl a file containing list of reference files, one per line -q , --query an input query file (fasta/fastq)[.gz] --ql , --queryList a file containing list of query files, one per line ... -t , --threads count of threads for parallel execution [default : 1] -o , --output output file name [default : mashmap.out] ...
[user@cn3200 ~]$ wget http://datasets.pacb.com/2013/Human10x/READS/2530572/0001/Analysis_Results/m130929_024849_42213_c100518541910000001823079209281311_s1_p0.1.subreads.fastq -O test_seq.fq --2019-01-10 12:08:07-- http://datasets.pacb.com/2013/Human10x/READS/2530572/0001/Analysis_Results/m130929_024849_42213_c100518541910000001823079209281311_s1_p0.1.subreads.fastq Resolving dtn04-e0 (dtn04-e0)... 10.1.200.240 Connecting to dtn04-e0 (dtn04-e0)|10.1.200.240|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 361764079 (345M) [application/octet-stream] Saving to: ‘test_seq.fq’ 100%[===============================================================>] 361,764,079 96.8MB/s in 3.6s 2019-01-10 12:08:11 (96.2 MB/s) - ‘test_seq.fq’ saved [361764079/361764079]Specify a human reference GRCh38 sequence:
[user@cn3200 ~]$ ln -s /fdb/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa human_GRCh38.faRun the software on these data with 16 threads:
[user@cn3200 ~]$ mashmap -r human_GRCh38.fa -q test_seq.fq -t 16 >>>>>>>>>>>>>>>>>> Reference = [human_GRCh38.fa] Query = [test_seq.fq] Kmer size = 16 Window size = 111 Segment length = 5000 (read split allowed) Alphabet = DNA Percentage identity threshold = 85% Mapping output file = mashmap.out Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) Execution threads = 16 >>>>>>>>>>>>>>>>>> INFO, skch::Sketch::build, minimizers picked from reference = 52513331 INFO, skch::Sketch::index, unique minimizers = 17607584 INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 9361370) ... (116827, 1) INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 5534 times during lookup. INFO, skch::main, Time spent computing the reference index: 58.0907 sec INFO, skch::Map::mapQuery, [count of mapped reads, reads qualified for mapping, total input reads] = [11978, 13497, 25249] INFO, skch::main, Time spent mapping the query : 21.6042 sec INFO, skch::main, mapping results saved in : mashmap.outThe result will be stored in the file 'mashmap.out'.
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$