spliceai-wrapper: wrapper for Illumina SpliceAI that caches results

spliceai-wrapper is a wrapper for Illumina SpliceAI that caches results.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=gpu:p100:1 -mem=8g 
[user@cn4466 ~]$ module load spliceai-wrapper
[+] Loading spliceai-wrapper  0.1.0
[user@cn4466 ~]$ spliceaiwrapper -h
usage: spliceai-wrapper [-h] [--version] {prepare,annotate} ...

Caching wrapper for Illumina SpliceAI

positional arguments:
  {prepare,annotate}
    prepare           Construct SQLite database from precomputed data
    annotate          Annotate VCF file with SpliceAI using cache for the
                      scores

optional arguments:
  -h, --help          show this help message and exit
  --version           show program's version number and exit

Download sample data:
[user@cn4466 ~]$ cp $SAIWR_DATA/whole_genome_filtered_spliceai_scores.vcf.gz . 
Import the precomputed scores into a SQLite3 database:
[user@cn4466 ~]$ spliceai-wrapper prepare -h
usage: spliceai-wrapper prepare [-h] [--release RELEASE]
                                [--precomputed-db-path PRECOMPUTED_DB_PATH]
                                --precomputed-vcf-path PRECOMPUTED_VCF_PATH

optional arguments:
  -h, --help            show this help message and exit
  --release RELEASE     Release to use.
  --precomputed-db-path PRECOMPUTED_DB_PATH
  --precomputed-vcf-path PRECOMPUTED_VCF_PATH
                        Path to VCF file for loading precomputed data from
[user@cn4466 ~]$ spliceai-wrapper prepare  --release GRCh37  \
          --precomputed-db-path precomputed.sqlite3     \
          --precomputed-vcf-path whole_genome_filtered_spliceai_scores.vcf.gz


[I 191011 12:49:37 __main__:148] Running 'prepare' with args = {'action': 'prepare', 'release': '
GRCh37', 'precomputed_db_path': './precomputed.sqlite3', 'precomputed_vcf_path': 'spliceai_wrappe
r/whole_genome_filtered_spliceai_scores.vcf.gz'}
[I 191011 12:49:37 __main__:153] Opening database file ./precomputed.sqlite3
[I 191011 12:49:37 __main__:158] Executing
    CREATE TABLE IF NOT EXISTS GRCh37_spliceai_scores
    (
        var_desc TEXT PRIMARY KEY,

        chromosome VARCHAR(64),
        position INTEGER,
        reference TEXT,
        alternative TEXT,

        symbol TEXT,
        strand CHARACTER,
        var_type CHARACTER,
        distance INTEGER,
        delta_score_acceptor_gain FLOAT,
        delta_score_acceptor_loss FLOAT,
        delta_score_donor_gain FLOAT,
        delta_score_donor_loss FLOAT,
        delta_position_acceptor_gain INTEGER,
        delta_position_acceptor_loss INTEGER,
        delta_position_donor_gain INTEGER,
        delta_position_donor_loss INTEGER
    );
     to create table...
[I 191011 12:49:37 __main__:161] Opening VCF for import: spliceai_wrapper/whole_genome_filtered_s
pliceai_scores.vcf.gz...
...
The latter command takes over 20 min to complete and produces the database file ./precomputed.sqlite3. Alternatively, the already precomputed database file can be used:
 
[user@cn4466 ~]$ cp $SAIWR_DATA/precomputed.sqlite3 .
Now annotate the variants from the database:
[user@cn4466 ~]$ spliceai-wrapper prepare -h
usage: spliceai-wrapper annotate [-h] --genes-tsv GENES_TSV
                                 [--release RELEASE]
                                 [--precomputed-db-path PRECOMPUTED_DB_PATH]
                                 [--cache-db-path CACHE_DB_PATH] --input-vcf
                                 INPUT_VCF --output-vcf OUTPUT_VCF
                                 [--min-score MIN_SCORE] [--head HEAD]
                                 --path-reference PATH_REFERENCE

optional arguments:
  -h, --help            show this help message and exit
  --genes-tsv GENES_TSV
                        Path to grch3[78].txt from SpliceAI
  --release RELEASE     Release to use.
  --precomputed-db-path PRECOMPUTED_DB_PATH
  --cache-db-path CACHE_DB_PATH
                        Path to SQLite3 file for the cache (to be updated)
  --input-vcf INPUT_VCF
                        Path to VCF file to annotate
  --output-vcf OUTPUT_VCF
                        Path to write annotated VCF to
  --min-score MIN_SCORE
                        Minimal score to consider (report as 0 if smaller).
  --head HEAD           Optional; only consider top N records.
  --path-reference PATH_REFERENCE
                        Path to reference FASTA file.
[user@cn4466 ~]$ cp $SAIWR_DATA/20190804.freebayes.filtered.vcf.gz .
[user@cn4466 ~]$ cp $SAIWR_DATA/grch37.txt . 
[user@cn4466 ~]$ ln -s /fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa
[user@cn4466 ~]$ ml spliceai-wrapper
spliceai-wrapper annotate     \
    --input-vcf ./20190804.freebayes.filtered.vcf.gz    \
    --output-vcf OUTPUT.vcf.gz     \
    --precomputed-db-path ./precomputed.sqlite3     \
    --release GRCh37     \
    --path-reference genome.fa \
    --genes-tsv ./grch37.txt 
...
2th': PosixPath('/home/staff/.cache/spliceai-wrapper/cache.sqlite3'), 'input_vcf': './20190
804.freebayes.filtered.vcf.gz', 'output_vcf': 'OUTPUT.vcf.gz', 'min_score': 0.1, 'head': None, 'p
ath_reference': '/fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa'}
[I 191015 09:22:29 __main__:282] Opening ./precomputed.sqlite3 (read-only)
[I 191015 09:22:29 __main__:283] URL = file:./precomputed.sqlite3?mode=ro
[I 191015 09:22:29 __main__:292] Opening /home/staff/.cache/spliceai-wrapper/cache.sqlite3 (c
ache; writeable)
[I 191015 09:22:29 __main__:297] Executing
    CREATE TABLE IF NOT EXISTS GRCh37_spliceai_scores
    (
        var_desc TEXT PRIMARY KEY,

        chromosome VARCHAR(64),
        position INTEGER,
        reference TEXT,
        alternative TEXT,

        symbol TEXT,
        strand CHARACTER,
        var_type CHARACTER,
        distance INTEGER,
        delta_score_acceptor_gain FLOAT,
        delta_score_acceptor_loss FLOAT,
        delta_score_donor_gain FLOAT,
        delta_score_donor_loss FLOAT,
        delta_position_acceptor_gain INTEGER,
        delta_position_acceptor_loss INTEGER,
        delta_position_donor_gain INTEGER,
        delta_position_donor_loss INTEGER
    );
    ...
[I 191015 09:22:29 __main__:300] Creating temporary directory...
[I 191015 09:22:29 __main__:303]  => /tmp/tmpcfre5h6w
[I 191015 09:22:29 __main__:307] Splitting ./20190804.freebayes.filtered.vcf.gz
[I 191015 09:22:29 __main__:308]   into: /tmp/tmpcfre5h6w/cache_hit.vcf
[I 191015 09:22:29 __main__:309]   and:  /tmp/tmpcfre5h6w/cache_nohit.vcf
18857records [00:22, 831.14records/s]
[I 191015 09:22:52 __main__:249] Hits: 2448/15024 (16.3%), pre hits 12576/15024 (83.7%), pre low-
score 0/12576 (0.0%), cache hits 2448/2448 (100.0%), no gene: 3833, cache misses: 0
[I 191015 09:22:52 __main__:347] No cache misses, no need to run spliceai
[I 191015 09:22:52 __main__:359] Converting result with bcftools view -O z -o OUTPUT.vcf.gz /tmp/
tmpcfre5h6w/cache_hit.vcf
[I 191015 09:22:53 __main__:412] Done running 'annotate'.

The output file OUTPUT.vcf.gz will be produced.
[user@cn4466 ~]$ exit
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. spliceai-wrapper.sh). For example:

#!/bin/bash
module load SpliceAI      
cp $SAIW_DATA/* . 

spliceai-wrapper prepare  --release GRCh37   --precomputed-db-path ./precomputed.sqlite3  \
     --precomputed-vcf-path whole_genome_filtered_spliceai_scores.vcf.gz
 
spliceai-wrapper annotate     \
    --input-vcf ./20190804.freebayes.filtered.vcf.gz    \
    --output-vcf OUTPUT.vcf.gz     \
    --precomputed-db-path ./precomputed.sqlite3     \
    --release GRCh37     \
    --path-reference /fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa     \
    --genes-tsv ./grch37.txt

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] spliceai-wrapper.sh