Pangolin on Biowulf

Quick Links

Pangolin is a deep-learning based method for predicting splice site strengths. It is available as a command-line tool that can be run on a VCF or CSV file containing variants of interest; Pangolin will predict changes in splice site strength due to each variant, and return a file of the same format. Pangolin's models can also be used with custom sequences.

References:

Zeng, T., Li, Y.I. "Predicting RNA splicing from DNA sequence using Pangolin." Genome Biol 23, 103 (2022)

Documentation

Pangolin GitHub repo

Important Notes

Module Name: pangolin-splice (see the modules page for more information). Note: this is not the similarly-named Pangolin
Pangolin can be run without a GPU, but it will be much, much slower. Please allocate a GPU for your Pangolin jobs.

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf ~]$ sinteractive --cpus-per-task=4 --mem=16G --gres=lscratch:10,gpu:1
salloc.exe: Pending job allocation 41538656
salloc.exe: job 41538656 queued and waiting for resources
salloc.exe: job 41538656 has been allocated resources
salloc.exe: Granted job allocation 41538656
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3114 are ready for job
srun: error: x11: no local DISPLAY defined, skipping

[user@cn3114 ~]$ cd /data/$USER

[user@cn3114 user]$ git clone https://github.com/tkzeng/Pangolin.git
Cloning into 'Pangolin'...
remote: Enumerating objects: 198, done.
remote: Counting objects: 100% (26/26), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 198 (delta 20), reused 18 (delta 18), pack-reused 172
Receiving objects: 100% (198/198), 190.04 MiB | 19.93 MiB/s, done.
Resolving deltas: 100% (54/54), done.

[user@cn3114 user]$ cd Pangolin/examples

[user@cn3114 examples]$ module load pangolin-splice
[+] Loading pangolin-splice  1.0.1  on cn4273
[+] Loading singularity  3.10.5  on cn4273

[user@cn3114 examples]$ cp /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.annotation.gtf.gz .

[user@cn3114 examples]$ create_db.py gencode.v38lift37.annotation.gtf.gz
Database created: gencode.v38lift37.annotation.db

[user@cn3114 examples]$ pangolin brca.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db brca_pangolin
Using GPU

[user@cn3114 examples]$ tail brca_pangolin.vcf
chr17   41276127        .       A       C       .       .       Pangolin=ENSG00000012048.23_1|-8:0.0|5:-0.06|Warnings:
chr17   41276126        .       C       T       .       .       Pangolin=ENSG00000012048.23_1|-2:0.13|6:-0.07|Warnings:
chr17   41276126        .       C       G       .       .       Pangolin=ENSG00000012048.23_1|-7:0.23|6:-0.05|Warnings:
chr17   41276126        .       C       A       .       .       Pangolin=ENSG00000012048.23_1|-7:0.48|6:-0.16|Warnings:
chr17   41276125        .       C       T       .       .       Pangolin=ENSG00000012048.23_1|-6:0.04|7:-0.05|Warnings:
chr17   41276125        .       C       G       .       .       Pangolin=ENSG00000012048.23_1|-6:0.08|7:-0.03|Warnings:
chr17   41276125        .       C       A       .       .       Pangolin=ENSG00000012048.23_1|-6:0.24|7:-0.26|Warnings:
chr17   41276124        .       T       G       .       .       Pangolin=ENSG00000012048.23_1|-5:0.02|8:-0.05|Warnings:
chr17   41276124        .       T       C       .       .       Pangolin=ENSG00000012048.23_1|4:0.0|8:-0.04|Warnings:
chr17   41276124        .       T       A       .       .       Pangolin=ENSG00000012048.23_1|-5:0.12|8:-0.12|Warnings:

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. pangolin.sh). For example:

#!/bin/bash
set -e
module load pangolin-splice
cd /data/$USER/Pangolin/examples/
pangolin brca.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db brca_pangolin

Submit this job using the Slurm sbatch command.

sbatch --gres=gpu:1[,lscratch=#] [--cpus-per-task=#] [--mem=#] pangolin.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. pangolin.swarm). For example:

pangolin brca.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db brca_pangolin
pangolin p53.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db p53_pangolin
pangolin kras.vcf /fdb/GENCODE/Gencode_human/release_38/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz  gencode.v38lift37.annotation.db kras_pangolin

Submit this job using the swarm command.

swarm -f pangolin.swarm [-g #] [-t #] --gres=gpu:1 --module pangolin-splice

where

`-g #`	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
`-t #`	Number of threads/CPUs required for each process (1 line in the swarm command file).
`--gres=gpu:1`	Allocates a GPU for each subjob
`--module pangolin-splice`	Loads the pangolin-splice module for each subjob in the swarm