mbg on Biowulf
From the repository:
Minimizer based sparse de Bruijn Graph constructor. Homopolymer compress input sequences, pick syncmers from hpc-compressed sequences, connect syncmers with an edge if they are adjacent in a read, unitigify and homopolymer decompress. Suggested input is PacBio HiFi/CCS reads, or ONT duplex reads. May or may not work with Illumina reads. Not suggested for PacBio CLR or regular ONT reads
References:
- M. Rautiainen, T. Marschall. MBG: Minimizer-based Sparse de Bruijn Graph Construction. Genome Biology (2020). PubMed | PMC | Journal
Documentation
- mbg on GitHub
Important Notes
- Module Name: mbg (see the modules page for more information)
- MBG is a multithreaded tool. Please match the number of allocated CPUs to the number of threads
- Example files in
$MBG_TEST_DATA
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --gres=lscratch:10 --cpus-per-task=2 --mem=3g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load mbg [user@cn3144]$ cp ${MBG_TEST_DATA:-none}/SRR10971019.fasta . [user@cn3144]$ MBG -t $SLURM_CPUS_PER_TASK -i SRR10971019.fasta -o SRR10971019_graph.gfa -k 1501 -w 1450 -a 1 -u 3 Parameters: k=1501,w=1450,a=1,u=3,t=2,r=0,R=0,hpcvariantcov=0,errormasking=hpc,endkmers=no,blunt=no,keepgaps=no,guesswork=no,cache=no Collecting selected k-mers Reading sequences from SRR10971019.fasta 1210730 total selected k-mers in reads 265228 distinct selected k-mers in reads Unitigifying Filtering by unitig coverage 3513 distinct selected k-mers in unitigs after filtering Getting read paths Reading sequences from SRR10971019.fasta Building unitig sequences Reading sequences from SRR10971019.fasta Writing graph to SRR10971019_graph.gfa selecting k-mers and building graph topology took 19,594 s unitigifying took 0,81 s filtering unitigs took 0,4 s getting read paths took 19,186 s building unitig sequences took 36,835 s forcing edge consistency took 0,24 s writing the graph and calculating stats took 0,94 s nodes: 567 edges: 730 assembly size 5346906 bp, N50 29122 approximate number of k-mers ~ 4495839 [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. mbg.sh), which uses the input file 'mbg.in'. For example:
#!/bin/bash module load mbg/1.0.16 cp ${MBG_TEST_DATA:-none}/SRR10971019.fasta . MBG -t $SLURM_CPUS_PER_TASK -i SRR10971019.fasta -o SRR10971019_graph.gfa -k 1501 -w 1450 -a 1 -u 3
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] mbg.sh