OptiType on Biowulf
OptiType is a novel HLA genotyping algorithm based on integer linear programming, capable of producing accurate 4-digit HLA genotyping predictions from NGS data by simultaneously selecting all major and minor HLA Class I alleles.
References:
Documentation
Important Notes
- Module Name: optitype (see the modules page for more information)
- Environment variables set
- OPTITYPE_HOME
- Default configuration file: $OPTITYPE_HOME/config.ini
copy and edit this to set the number of threads you want to use, matching $SLURM_CPUS_PER_TASK - Example files in $OPTITYPE_HOME/test
- Reference data in $OPTITYPE_HOME/data
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --cpus-per-task 4 --gres lscratch:10 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load optitype [+] Loading glpk 4.65 [+] Loading HDF5 1.10.1 [+] Loading samtools 1.9 ... [+] Loading optitype, version 1.3.2... [user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144 46116226]$ cp -a $OPTITYPE_HOME/test . [user@cn3144 46116226]$ cp $OPTITYPE_HOME/config.ini . # copy the default configuration and edit it to set the number of threads you want to use. [user@cn3144 46116226]$ OptiTypePipeline.py -i ./test/rna/CRC_81_N_1_fished.fastq ./test/rna/CRC_81_N_2_fished.fastq --rna -v -o ./test/rna/ -c config.ini mapping with 4 threads... 0:00:03.77 Mapping CRC_81_N_1_fished.fastq to NUC reference... 0:00:08.84 Mapping CRC_81_N_2_fished.fastq to NUC reference... 0:00:14.42 Generating binary hit matrix. 0:00:14.45 Loading ./test/rna/2019_02_01_16_43_08/2019_02_01_16_43_08_1.bam started. Number of HLA reads loaded (updated every thousand): 0:00:14.60 301 reads loaded. Creating dataframe... 0:00:14.62 Dataframes created. Shape: 301 x 7339, hits: 67072 (111698), sparsity: 1 in 19.78 0:00:14.70 Loading ./test/rna/2019_02_01_16_43_08/2019_02_01_16_43_08_2.bam started. Number of HLA reads loaded (updated every thousand): 0:00:14.89 291 reads loaded. Creating dataframe... 0:00:14.91 Dataframes created. Shape: 291 x 7339, hits: 59422 (104782), sparsity: 1 in 20.38 0:00:14.95 Alignment pairing completed. 152 paired, 268 unpaired, 10 discordant 0:00:17.07 temporary pruning of identical rows and columns 0:00:17.11 Size of mtx with unique rows and columns: (124, 250) 0:00:17.11 determining minimal set of non-overshadowed alleles 0:00:17.58 Keeping only the minimal number of required alleles (14,) 0:00:17.58 Creating compact model... starting ilp solver with 1 threads... 0:00:17.60 Initializing OptiType model... GLPSOL: GLPK LP/MIP Solver, v4.65 Parameter(s) specified in the command line: --write /tmp/tmpHIKpQd.glpk.raw --wglp /tmp/tmpM6Df7g.glpk.glp --cpxlp /tmp/tmpkcqGUm.pyomo.lp Reading problem data from '/tmp/tmpkcqGUm.pyomo.lp'... /tmp/tmpkcqGUm.pyomo.lp:715: warning: lower bound of variable 'x1' redefined /tmp/tmpkcqGUm.pyomo.lp:715: warning: upper bound of variable 'x1' redefined 104 rows, 64 columns, 283 non-zeros 38 integer variables, all of which are binary 753 lines were read Writing problem data to '/tmp/tmpM6Df7g.glpk.glp'... 632 lines were written GLPK Integer Optimizer, v4.65 104 rows, 64 columns, 283 non-zeros 38 integer variables, all of which are binary Preprocessing... 14 hidden covering inequaliti(es) were detected 103 rows, 63 columns, 282 non-zeros 38 integer variables, all of which are binary Scaling... A: min|aij| = 1.000e+00 max|aij| = 3.000e+00 ratio = 3.000e+00 Problem data seem to be well scaled Constructing initial basis... Size of triangular part is 103 Solving LP relaxation... GLPK Simplex Optimizer, v4.65 103 rows, 63 columns, 282 non-zeros 0: obj = -0.000000000e+00 inf = 3.000e+00 (3) 3: obj = -0.000000000e+00 inf = 0.000e+00 (0) * 65: obj = 1.284360000e+02 inf = 6.291e-16 (0) OPTIMAL LP SOLUTION FOUND Integer optimization begins... Long-step dual simplex will be used + 65: mip = not found yet <= +inf (1; 0) + 65: >>>>> 1.284360000e+02 <= 1.284360000e+02 0.0% (1; 0) + 65: mip = 1.284360000e+02 <= tree is empty 0.0% (0; 1) INTEGER OPTIMAL SOLUTION FOUND Time used: 0.0 secs Memory used: 0.2 Mb (193401 bytes) Writing MIP solution to '/tmp/tmpHIKpQd.glpk.raw'... 177 lines were written 0:00:17.78 Result dataframe has been constructed... [user@cn3144 46116226]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. optitype.sh). For example:
#!/bin/bash set -e module load optitype OptiTypePipeline.py -i ./test/rna/CRC_81_N_1_fished.fastq ./test/rna/CRC_81_N_2_fished.fastq --rna -v -o ./test/rna/ -c config.ini
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] optitype.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. optitype.swarm). For example:
python OptiTypePipeline.py -i ./test/exome/NA11995_SRR766010_1_fished.fastq ./test/exome/NA11995_SRR766010_2_fished.fastq --dna -v -o ./test/exome/ -c config.ini OptiTypePipeline.py -i ./test/rna/CRC_81_N_1_fished.fastq ./test/rna/CRC_81_N_2_fished.fastq --rna -v -o ./test/rna/ -c config.ini
Submit this job using the swarm command.
swarm -f optitype.swarm [-g #] [-t #] --module optitypewhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module optitype | Loads the OptiType module for each subjob in the swarm |