From the deepsignal home page
A deep-learning method for detecting DNA methylation state from Oxford Nanopore sequencing reads.
$DEEPSIGNAL_TEST_DATAAllocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=10g -c6 --gres=lscratch:50,gpu:p100:1
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load guppy deepsignal/2-0.1.3
[+] Loading singularity 3.7.1 on cn2384
[+] Loading guppy 4.2.2 ...
[+] Loading deepsignal 0.1.8
[user@cn3144]$ tar -xzf ${DEEPSIGNAL_TEST_DATA:-none}/fast5s.sample.tar.gz
[user@cn3144]$ ll
total 13M
drwxr-xr-x 3 user group 548K Dec 17 2018 fast5s.al
-rw-r--r-- 1 user group 12M Dec 17 2018 GCF_000146045.2_R64_genomic.fna
To call modifications, the raw fast5 files should be basecalled (Guppy or Albacore) and then be re-squiggled by tombo. At last, modifications of specified motifs can be called by deepsignal. The following are commands to call 5mC in CG contexts from the example data:
[user@cn3144]$ guppy_basecaller -x cuda:all \
-i fast5s.al -r -s fast5s.al.guppy --config dna_r9.4.1_450bps_hac_prom.cfg
ONT Guppy basecalling software version 4.2.2+effbaf8
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_hac_prom.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_hac_prom.jsn
input path: fast5s.al
save path: fast5s.al.guppy
chunk size: 2000
chunks per runner: 1248
records per file: 4000
num basecallers: 4
gpu device: cuda:all
kernel path:
runners per device: 12
Found 3621 fast5 files to process.
Init time: 6689 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 23257 ms, Samples called: 53658471, samples/s: 2.3072e+06
Finishing up any open output files.
Basecalling completed successfully.
[user@cn3144]$ tombo preprocess annotate_raw_with_fastqs \
--fast5-basedir fast5s.al \
--fastq-filenames fast5s.al.guppy/{pass,fail}/*.fastq \
--overwrite \
--processes 6 \
--sequencing-summary-filenames fast5s.al.guppy/sequencing_summary.txt
[14:45:09] Getting read filenames.
[14:45:09] Parsing sequencing summary files.
[14:45:09] Annotating FAST5s with sequence from FASTQs.
100%|███████████████████████████████████████████████████████████████| 3621/3621
[14:45:11] Added sequences to a total of 3621 reads.
[user@cn3144]$ tombo resquiggle fast5s.al \
GCF_000146045.2_R64_genomic.fna \
--processes 6 \
--corrected-group RawGenomeCorrected_001 \
--overwrite
[14:47:15] Loading minimap2 reference.
[14:47:16] Getting file list.
[14:47:16] Loading default canonical ***** DNA ***** model.
[14:47:16] Re-squiggling reads (raw signal to genomic sequence alignment).
100%|█████████████████████████████████████████████████████████████████| 3621/3621
[14:49:08] Final unsuccessful reads summary (4.7% reads unsuccessfully processed; 169 total reads):
3.2% (115 reads) : Alignment not produced
1.2% ( 44 reads) : Poor raw to expected signal matching (revert with `tombo filter clear_filters`)
0.3% ( 10 reads) : Read event to sequence alignment extends beyond bandwidth
[14:49:08] Saving Tombo reads index to file.
Notes:
The available models can be checked with ls /usr/local/apps/deepsignal/VER/model or /usr/local/apps/deepsignal/VER/share/model. Because deepsignal is installed as a container the path used for deepsignal is /model/NAME.
[user@cn3144]$ deepsignal2 call_mods \
--nproc $SLURM_CPUS_PER_TASK \
--input_path fast5s.al/ \
--model_path /model/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt \
--result_file fast5s.al.CpG.call_mods.tsv \
--reference_path GCF_000146045.2_R64_genomic.fna \
--corrected_group RawGenomeCorrected_001
[...snip...]
[user@cn3144]$ call_modification_frequency.py \
--input_path fast5s.al.CpG.call_mods.tsv \
--result_file fast5s.al.CpG.call_mods.frequency.tsv
get 1 input file(s)..
reading the input files..
162916 of 162916 samples used..
writing the result..
[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$
Create a batch input file (e.g. deepsignal.sh). In this example we assume that the base calling and resquiggle has already been completed and we just run the call_mods step
#!/bin/bash
module load deepsignal/2-0.1.3
deepsignal2 call_mods \
--nproc $SLURM_CPUS_PER_TASK \
--input_path fast5s.al/ \
--model_path /model/model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt \
--result_file fast5s.al.CpG.call_mods.tsv \
--reference_path GCF_000146045.2_R64_genomic.fna \
--corrected_group RawGenomeCorrected_001
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=6 --mem=20g --gres=gpu:p100:1,lscratch:10 deepsignal.sh