OptiType on Biowulf

OptiType is a novel HLA genotyping algorithm based on integer linear programming, capable of producing accurate 4-digit HLA genotyping predictions from NGS data by simultaneously selecting all major and minor HLA Class I alleles.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --cpus-per-task 4 --gres lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load optitype
[+] Loading glpk  4.65 
[+] Loading HDF5  1.10.1 
[+] Loading samtools 1.9  ... 
[+] Loading optitype, version 1.3.2...
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 46116226]$ cp -a $OPTITYPE_HOME/test . 
[user@cn3144 46116226]$ cp $OPTITYPE_HOME/config.ini . # copy the default configuration and edit it to set the number of threads you want to use.
[user@cn3144 46116226]$ OptiTypePipeline.py -i ./test/rna/CRC_81_N_1_fished.fastq ./test/rna/CRC_81_N_2_fished.fastq --rna -v -o ./test/rna/ -c config.ini

    mapping with 4 threads...

    0:00:03.77 Mapping CRC_81_N_1_fished.fastq to NUC reference...

    0:00:08.84 Mapping CRC_81_N_2_fished.fastq to NUC reference...

    0:00:14.42 Generating binary hit matrix.
    0:00:14.45 Loading ./test/rna/2019_02_01_16_43_08/2019_02_01_16_43_08_1.bam started. Number of HLA reads loaded (updated every thousand):

    0:00:14.60 301 reads loaded. Creating dataframe...
    0:00:14.62 Dataframes created. Shape: 301 x 7339, hits: 67072 (111698), sparsity: 1 in 19.78
    0:00:14.70 Loading ./test/rna/2019_02_01_16_43_08/2019_02_01_16_43_08_2.bam started. Number of HLA reads loaded (updated every thousand):

    0:00:14.89 291 reads loaded. Creating dataframe...
    0:00:14.91 Dataframes created. Shape: 291 x 7339, hits: 59422 (104782), sparsity: 1 in 20.38
    0:00:14.95 Alignment pairing completed. 152 paired, 268 unpaired, 10 discordant 

    0:00:17.07 temporary pruning of identical rows and columns

    0:00:17.11 Size of mtx with unique rows and columns: (124, 250)
    0:00:17.11 determining minimal set of non-overshadowed alleles

    0:00:17.58 Keeping only the minimal number of required alleles (14,)

    0:00:17.58 Creating compact model...

    starting ilp solver with 1 threads...

    0:00:17.60 Initializing OptiType model...
    GLPSOL: GLPK LP/MIP Solver, v4.65
    Parameter(s) specified in the command line:
    --write /tmp/tmpHIKpQd.glpk.raw --wglp /tmp/tmpM6Df7g.glpk.glp --cpxlp /tmp/tmpkcqGUm.pyomo.lp
    Reading problem data from '/tmp/tmpkcqGUm.pyomo.lp'...
    /tmp/tmpkcqGUm.pyomo.lp:715: warning: lower bound of variable 'x1' redefined
    /tmp/tmpkcqGUm.pyomo.lp:715: warning: upper bound of variable 'x1' redefined
    104 rows, 64 columns, 283 non-zeros
    38 integer variables, all of which are binary
    753 lines were read
    Writing problem data to '/tmp/tmpM6Df7g.glpk.glp'...
    632 lines were written
    GLPK Integer Optimizer, v4.65
    104 rows, 64 columns, 283 non-zeros
    38 integer variables, all of which are binary
    Preprocessing...
    14 hidden covering inequaliti(es) were detected
    103 rows, 63 columns, 282 non-zeros
    38 integer variables, all of which are binary
    Scaling...
    A: min|aij| =  1.000e+00  max|aij| =  3.000e+00  ratio =  3.000e+00
    Problem data seem to be well scaled
    Constructing initial basis...
    Size of triangular part is 103
    Solving LP relaxation...
    GLPK Simplex Optimizer, v4.65
    103 rows, 63 columns, 282 non-zeros
    0: obj =  -0.000000000e+00 inf =   3.000e+00 (3)
    3: obj =  -0.000000000e+00 inf =   0.000e+00 (0)
    *    65: obj =   1.284360000e+02 inf =   6.291e-16 (0)
    OPTIMAL LP SOLUTION FOUND
    Integer optimization begins...
    Long-step dual simplex will be used
    +    65: mip =     not found yet <=              +inf        (1; 0)
    +    65: >>>>>   1.284360000e+02 <=   1.284360000e+02   0.0% (1; 0)
    +    65: mip =   1.284360000e+02 <=     tree is empty   0.0% (0; 1)
    INTEGER OPTIMAL SOLUTION FOUND
    Time used:   0.0 secs
    Memory used: 0.2 Mb (193401 bytes)
    Writing MIP solution to '/tmp/tmpHIKpQd.glpk.raw'...
    177 lines were written

    0:00:17.78 Result dataframe has been constructed...
[user@cn3144 46116226]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. optitype.sh). For example:

#!/bin/bash
set -e
module load optitype
OptiTypePipeline.py -i ./test/rna/CRC_81_N_1_fished.fastq ./test/rna/CRC_81_N_2_fished.fastq --rna -v -o ./test/rna/ -c config.ini

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] optitype.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. optitype.swarm). For example:

python OptiTypePipeline.py -i ./test/exome/NA11995_SRR766010_1_fished.fastq ./test/exome/NA11995_SRR766010_2_fished.fastq --dna -v -o ./test/exome/ -c config.ini
OptiTypePipeline.py -i ./test/rna/CRC_81_N_1_fished.fastq ./test/rna/CRC_81_N_2_fished.fastq --rna -v -o ./test/rna/ -c config.ini

Submit this job using the swarm command.

swarm -f optitype.swarm [-g #] [-t #] --module optitype
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module optitype Loads the OptiType module for each subjob in the swarm