megalodon on Biowulf
Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transcriptome.
References:
Documentation
Important Notes
- Module Name: megalodon (see the modules page for more information)
- This is a GPU-only application and it will only run on Pascal or higher architecture. It will NOT run on Kepler (k20 or k80) GPUs.
- Example files in /usr/local/apps/megalodon/<ver>/example
- Use --guppy-server-path ${MEGALODON_GUPPY_PATH} in your megalodon call. It points to the correct executable within the container.
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. The data in this example do not produce useful results but are for illustrative purposes only.
Sample session (user input in bold):
[user@biowulf ~]$ sinteractive -c12 --mem=32g --gres=lscratch:10,gpu:p100:1 salloc.exe: Pending job allocation 5091510 salloc.exe: job 5091510 queued and waiting for resources salloc.exe: job 5091510 has been allocated resources salloc.exe: Granted job allocation 5091510 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn2369 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@cn2369 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn2369 5091510]$ module load megalodon [+] Loading megalodon 2.5.0 on cn2369 [+] Loading singularity 3.8.1-1 on cn2369 [user@cn2369 5091510]$ cp -r /usr/local/apps/megalodon/2.5.0/TEST_DATA/ . [user@cn2369 5091510]$ cd TEST_DATA/ [user@cn2369 example]$ megalodon fast5/ \ --guppy-server-path ${MEGALODON_GUPPY_PATH} \ --outputs basecalls mappings mod_mappings mods \ --output-directory mega_out \ --reference /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ --devices ${CUDA_VISIBLE_DEVICES} \ --processes ${SLURM_CPUS_PER_TASK} \ --overwrite \ --guppy-config res_dna_r941_min_modbases_5mC_CpG_v001.cfg ******************** WARNING: "mods" output requested, so "per_read_mods" will be added to outputs. ******************** [12:27:45] Loading guppy basecalling backend [2022-07-12 12:27:49.969485] [0x00002aaaf57a0700] [info] Connecting to server as '' [2022-07-12 12:27:49.972970] [0x00002aaaf57a0700] [info] Connected to server as ''. Connection id: e96656bc-1a2a-4e39-8ea3-0dccc1f0757a [12:27:50] Loading reference [12:29:32] Loaded model calls canonical alphabet ACGT and modified bases m=5mC (alt to C) [12:29:33] Preparing workers to process reads [12:29:34] Processing reads Full output or empty input queues indicate I/O bottleneck 3 most common unsuccessful processing stages: ----- [2022-07-12 12:29:34.642849] [0x00002aaaed59f700] [info] Connecting to server as '' [2022-07-12 12:29:34.646135] [0x00002aaaed59f700] [info] Connected to server as ''. Connection id: 4b6bae96-7baa-44a2-84fa-a8ab743d4efa, ?reads/s] ... [2022-07-12 12:29:37.269726] [0x00002aaaed59f700] [info] Connecting to server as '' [2022-07-12 12:29:37.286846] [0x00002aaaed59f700] [info] Connected to server as ''. Connection id: 03633509-8c2e-45a1-b4a9-feda7170a5ca 99.6% ( 7871 reads) : No alignment ----- ----- Read Processing: 100%|██████████████████████████████| 8000/8000 [02:03<00:00, 64.97reads/s, samples/s=2.4e+6] input queue capacity extract_signal : 0%| | 0/10000 output queue capacity basecalls : 0%| | 0/10000 output queue capacity mappings : 0%| | 0/10000 output queue capacity per_read_mods : 0%| | 0/10000 [12:31:37] Unsuccessful processing types: 99.7% ( 7972 reads) : No alignment [12:31:38] Spawning modified base aggregation processes [12:31:40] Aggregating 5946 per-read modified base statistics [12:31:40] NOTE: If this step is very slow, ensure the output directory is located on a fast read disk (e.g. local SSD). Aggregation can be restarted using the `megalodon_extras aggregate run` command Mods: 100%|███████████████████████████████████████████████| 5946/5946 [00:00<00:00, 27328.23 per-read calls/s][12:31:41] Mega Done [user@cn2369 example]$ exit exit salloc.exe: Relinquishing job allocation 5091510 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. megalodon.sh). For example:
#!/bin/bash set -e module load megalodon megalodon fast5/ \ --guppy-server-path ${MEGALODON_GUPPY_PATH} \ --outputs basecalls mappings mod_mappings mods \ --output-directory mega_out \ --reference /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ --devices ${CUDA_VISIBLE_DEVICES} \ --processes ${SLURM_CPUS_PER_TASK} \ --overwrite
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] [--gres=cpu:#] megalodon.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. megalodon.swarm). For example:
megalodon dir1/ \ --guppy-server-path ${MEGALODON_GUPPY_PATH} \ --outputs basecalls mappings mod_mappings mods \ --output-directory out1 \ --reference /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ --devices ${CUDA_VISIBLE_DEVICES} \ --processes ${SLURM_CPUS_PER_TASK} megalodon dir2/ \ --guppy-server-path ${MEGALODON_GUPPY_PATH} \ --outputs basecalls mappings mod_mappings mods \ --output-directory out2 \ --reference /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ --devices ${CUDA_VISIBLE_DEVICES} \ --processes ${SLURM_CPUS_PER_TASK} megalodon dir3/ \ --guppy-server-path ${MEGALODON_GUPPY_PATH} \ --outputs basecalls mappings mod_mappings mods \ --output-directory out3 \ --reference /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ --devices ${CUDA_VISIBLE_DEVICES} \ --processes ${SLURM_CPUS_PER_TASK}
Submit this job using the swarm command.
swarm -f megalodon.swarm [-g #] [-t #] [--gres <gpu>:#] --module megalodonwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--gres # | GPU type and number to use for each subjob. |
--module megalodon | Loads the megalodon module for each subjob in the swarm |