From the deepsignal home page
A deep-learning method for detecting DNA methylation state from Oxford Nanopore sequencing reads.
$DEEPSIGNAL_TEST_DATA
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=10g -c6 --gres=lscratch:50,gpu:p100:1 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load guppy deepsignal/2-0.1.3 [+] Loading singularity 3.7.1 on cn2384 [+] Loading guppy 4.2.2 ... [+] Loading deepsignal 0.1.8 [user@cn3144]$ tar -xzf ${DEEPSIGNAL_TEST_DATA:-none}/fast5s.sample.tar.gz [user@cn3144]$ ll total 13M drwxr-xr-x 3 user group 548K Dec 17 2018 fast5s.al -rw-r--r-- 1 user group 12M Dec 17 2018 GCF_000146045.2_R64_genomic.fna
To call modifications, the raw fast5 files should be basecalled (Guppy or Albacore) and then be re-squiggled by tombo. At last, modifications of specified motifs can be called by deepsignal. The following are commands to call 5mC in CG contexts from the example data:
[user@cn3144]$ guppy_basecaller -x cuda:all \ -i fast5s.al -r -s fast5s.al.guppy --config dna_r9.4.1_450bps_hac_prom.cfg ONT Guppy basecalling software version 4.2.2+effbaf8 config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_hac_prom.cfg model file: /opt/ont/guppy/data/template_r9.4.1_450bps_hac_prom.jsn input path: fast5s.al save path: fast5s.al.guppy chunk size: 2000 chunks per runner: 1248 records per file: 4000 num basecallers: 4 gpu device: cuda:all kernel path: runners per device: 12 Found 3621 fast5 files to process. Init time: 6689 ms 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** Caller time: 23257 ms, Samples called: 53658471, samples/s: 2.3072e+06 Finishing up any open output files. Basecalling completed successfully. [user@cn3144]$ tombo preprocess annotate_raw_with_fastqs \ --fast5-basedir fast5s.al \ --fastq-filenames fast5s.al.guppy/{pass,fail}/*.fastq \ --overwrite \ --processes 6 \ --sequencing-summary-filenames fast5s.al.guppy/sequencing_summary.txt [14:45:09] Getting read filenames. [14:45:09] Parsing sequencing summary files. [14:45:09] Annotating FAST5s with sequence from FASTQs. 100%|███████████████████████████████████████████████████████████████| 3621/3621 [14:45:11] Added sequences to a total of 3621 reads. [user@cn3144]$ tombo resquiggle fast5s.al \ GCF_000146045.2_R64_genomic.fna \ --processes 6 \ --corrected-group RawGenomeCorrected_001 \ --overwrite [14:47:15] Loading minimap2 reference. [14:47:16] Getting file list. [14:47:16] Loading default canonical ***** DNA ***** model. [14:47:16] Re-squiggling reads (raw signal to genomic sequence alignment). 100%|█████████████████████████████████████████████████████████████████| 3621/3621 [14:49:08] Final unsuccessful reads summary (4.7% reads unsuccessfully processed; 169 total reads): 3.2% (115 reads) : Alignment not produced 1.2% ( 44 reads) : Poor raw to expected signal matching (revert with `tombo filter clear_filters`) 0.3% ( 10 reads) : Read event to sequence alignment extends beyond bandwidth [14:49:08] Saving Tombo reads index to file.
Notes:
The available models can be checked with ls /usr/local/apps/deepsignal/VER/model
or /usr/local/apps/deepsignal/VER/share/model
. Because deepsignal is installed as a container the path used for deepsignal is /model/NAME
.
[user@cn3144]$ deepsignal2 call_mods \ --nproc $SLURM_CPUS_PER_TASK \ --input_path fast5s.al/ \ --model_path /model/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt \ --result_file fast5s.al.CpG.call_mods.tsv \ --reference_path GCF_000146045.2_R64_genomic.fna \ --corrected_group RawGenomeCorrected_001 [...snip...] [user@cn3144]$ call_modification_frequency.py \ --input_path fast5s.al.CpG.call_mods.tsv \ --result_file fast5s.al.CpG.call_mods.frequency.tsv get 1 input file(s).. reading the input files.. 162916 of 162916 samples used.. writing the result.. [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Create a batch input file (e.g. deepsignal.sh). In this example we assume that the base calling and resquiggle has already been completed and we just run the call_mods step
#!/bin/bash module load deepsignal/2-0.1.3 deepsignal2 call_mods \ --nproc $SLURM_CPUS_PER_TASK \ --input_path fast5s.al/ \ --model_path /model/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt \ --result_file fast5s.al.CpG.call_mods.tsv \ --reference_path GCF_000146045.2_R64_genomic.fna \ --corrected_group RawGenomeCorrected_001
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=6 --mem=20g --gres=gpu:p100:1,lscratch:10 deepsignal.sh