Pangolin on Biowulf

Pangolin is a deep-learning based method for predicting splice site strengths. It is available as a command-line tool that can be run on a VCF or CSV file containing variants of interest; Pangolin will predict changes in splice site strength due to each variant, and return a file of the same format. Pangolin's models can also be used with custom sequences.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf ~]$ sinteractive --cpus-per-task=4 --mem=16G --gres=lscratch:10,gpu:1
salloc.exe: Pending job allocation 41538656
salloc.exe: job 41538656 queued and waiting for resources
salloc.exe: job 41538656 has been allocated resources
salloc.exe: Granted job allocation 41538656
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3114 are ready for job
srun: error: x11: no local DISPLAY defined, skipping

[user@cn3114 ~]$ cd /data/$USER

[user@cn3114 user]$ git clone https://github.com/tkzeng/Pangolin.git
Cloning into 'Pangolin'...
remote: Enumerating objects: 198, done.
remote: Counting objects: 100% (26/26), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 198 (delta 20), reused 18 (delta 18), pack-reused 172
Receiving objects: 100% (198/198), 190.04 MiB | 19.93 MiB/s, done.
Resolving deltas: 100% (54/54), done.

[user@cn3114 user]$ cd Pangolin/examples

[user@cn3114 examples]$ module load pangolin-splice
[+] Loading pangolin-splice  1.0.1  on cn4273
[+] Loading singularity  3.10.5  on cn4273

[user@cn3114 examples]$ cp /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.annotation.gtf.gz .

[user@cn3114 examples]$ create_db.py gencode.v38lift37.annotation.gtf.gz
Database created: gencode.v38lift37.annotation.db

[user@cn3114 examples]$ pangolin brca.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db brca_pangolin
Using GPU

[user@cn3114 examples]$ tail brca_pangolin.vcf
chr17   41276127        .       A       C       .       .       Pangolin=ENSG00000012048.23_1|-8:0.0|5:-0.06|Warnings:
chr17   41276126        .       C       T       .       .       Pangolin=ENSG00000012048.23_1|-2:0.13|6:-0.07|Warnings:
chr17   41276126        .       C       G       .       .       Pangolin=ENSG00000012048.23_1|-7:0.23|6:-0.05|Warnings:
chr17   41276126        .       C       A       .       .       Pangolin=ENSG00000012048.23_1|-7:0.48|6:-0.16|Warnings:
chr17   41276125        .       C       T       .       .       Pangolin=ENSG00000012048.23_1|-6:0.04|7:-0.05|Warnings:
chr17   41276125        .       C       G       .       .       Pangolin=ENSG00000012048.23_1|-6:0.08|7:-0.03|Warnings:
chr17   41276125        .       C       A       .       .       Pangolin=ENSG00000012048.23_1|-6:0.24|7:-0.26|Warnings:
chr17   41276124        .       T       G       .       .       Pangolin=ENSG00000012048.23_1|-5:0.02|8:-0.05|Warnings:
chr17   41276124        .       T       C       .       .       Pangolin=ENSG00000012048.23_1|4:0.0|8:-0.04|Warnings:
chr17   41276124        .       T       A       .       .       Pangolin=ENSG00000012048.23_1|-5:0.12|8:-0.12|Warnings:

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. pangolin.sh). For example:

#!/bin/bash
set -e
module load pangolin-splice
cd /data/$USER/Pangolin/examples/
pangolin brca.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db brca_pangolin

Submit this job using the Slurm sbatch command.

sbatch --gres=gpu:1[,lscratch=#] [--cpus-per-task=#] [--mem=#] pangolin.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. pangolin.swarm). For example:

pangolin brca.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db brca_pangolin
pangolin p53.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db p53_pangolin
pangolin kras.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db kras_pangolin

Submit this job using the swarm command.

swarm -f pangolin.swarm [-g #] [-t #] --gres=gpu:1 --module pangolin-splice
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--gres=gpu:1 Allocates a GPU for each subjob
--module pangolin-splice Loads the pangolin-splice module for each subjob in the swarm