SpliceAI: predicting splicing from primary sequence with deep learning
SpliceAI is a deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing.
References:
- Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, et al.
Predicting Splicing from Primary Sequence with Deep Learning , Cell 176, 535–548, January 24, 2019
Documentation
Important Notes
- Module Name: SpliceAI (see the modules page for more information)
- Singlethreaded
- Unusual environment variables set
- SPLICEAI_HOME installation directory
- SPLICEAI_BIN executable directory
- SPLICEAI_DATA sample data directory
- SPLICEAI_SRC source code directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=8g --gres=gpu:v100:1 [user@cn4466 ~]$ module load SpliceAI +] Loading python 3.6 ... [+] Loading cuDNN 7.0 libraries... [+] Loading CUDA Toolkit 9.0.176 ... [user@cn4466 ~]$ spliceai -h ... usage: spliceai [-h] [-I [input]] [-O [output]] -R reference -A annotation [-D [distance]] [-M [mask]] Version: 1.3 optional arguments: -h, --help show this help message and exit -I [input] path to the input VCF file, defaults to standard in -O [output] path to the output VCF file, defaults to standard out -R reference path to the reference genome fasta file -A annotation "grch37" (GENCODE V24lift37 canonical annotation file in package), "grch38" (GENCODE V24 canonical annotation file in package), or path to a similar custom gene annotation file -D [distance] maximum distance between the variant and gained/lost splice site, defaults to 50 -M [mask] mask scores representing annotated acceptor/donor gain and unannotated acceptor/donor loss, defaults to 0Download sample data:
[user@cn4466 ~]$ cp $SPLICEAI_DATA/* .Specify a reference sequence:
[user@cn4466 ~]$ ln -s /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa hg38.faRun the spliceai executable on the sample data:
[user@cn4466 ~]$ spliceai -I input.vcf -O output.vcf -R hg38.fa -A grch37 & ... [user@cn4466 ~]$ nvidia-smi ... +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 On | 00000000:84:00.0 Off | Off | | N/A 35C P0 71W / 149W | 11621MiB / 12206MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 45210 C /usr/bin/python 11616MiB | +-----------------------------------------------------------------------------+ ... WARNING:root:Skipping record (ref issue): 2 152389953 . T A,C,G . . . WARNING:root:Skipping record (ref issue): 2 179415988 . C CA . . . WARNING:root:Skipping record (ref issue): 2 179446218 . ATACT A . . . WARNING:root:Skipping record (ref issue): 2 179446218 . ATACT AT,ATA . . . WARNING:root:Skipping record (ref issue): 2 179642185 . G A . . . WARNING:root:Skipping record (ref issue): 19 38958362 . C T . . . WARNING:root:Skipping record (ref issue): 21 47406854 . CCA C . . . WARNING:root:Skipping record (ref issue): 21 47406856 . A AT . . . WARNING:root:Skipping record (ref issue): X 129274636 . A C,G,T . . .An output file output.vcf will be produced:
[user@cn4466 ~]$ cat output.vcf ... #CHROM POS ID REF ALT QUAL FILTER INFO 1 25000 . A C,G,T . . . 2 152389953 . T A,C,G . . . 2 179415988 . C CA . . . 2 179446218 . ATACT A . . . 2 179446218 . ATACT AT,ATA . . . 2 179642185 . G A . . . 19 38958362 . C T . . . 21 47406854 . CCA C . . . 21 47406856 . A AT . . . X 129274636 . A C,G,T . . .Exit the application:
[user@cn4466 ~]$ exit [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. spliceai.sh). For example:
#!/bin/bash module load SpliceAI cp $SPLICEAI_DATA/* . spliceai -I input.vcf -O output.vcf -R /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -A grch37
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] spliceai.sh