High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
interproscan on Biowulf & Helix

Quick Links
Description

From the InterProScan home page:

InterProScan is the software package that allows sequences (protein and nucleic) to be scanned against InterPro's signatures. Signatures are predictive models, provided by several different databases, that make up the InterPro consortium.

There may be multiple versions of interproscan available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail interproscan 

To select a module use

module load interproscan/[version]

where [version] is the version of choice.

Environment variables set

Dependencies

Loaded automatically

Documentation

Batch job on Biowulf

InterProScan on our systems is used through a wrapper script that takes a subset of the native InterProScan options. The wrapper script then splits the input into chunks (if necessary) and sets up a slurm job array script to run interproscan in parallel on each chunk. The chunks are designed to run on the quick partition and make extensive use of the node-local lscratch disks.

Set up the environment and copy the example data (4303 proteins from the e. coli genome)

biowulf$ cp $INTERPROSCAN_TEST_DATA/ecoli_benchmark.fa .
biowulf$ grep -c '^>' ecoli_benchmark.fa
4303
biowulf$ module load interproscan
[+] Loading java 1.8.0_92 ...
[+] Loading interproscan 5.22-61.0
biowulf$ interproscan --help

NAME
    interproscan - set up and run an interproscan job on the cluster

SYNOPSIS
    interproscan [interproscan options] <infile> <runname> <chunksize>

DESCRIPTION
    This is a biowulf specific wrapper around interproscan.
    It takes an input file and splits it up into chunks
    that can run on the quick partition. It takes a subset of
    interproscan options along with some arguments.

    The <runname> is used to create a folder structure holding
    potentially split input files and results for each chunk.
    <chunksize> determines how many sequences are processed
    by each subjob. This must be between 200 and 1000.

    The following interpro options are the only options 
    supported by this wrapper:

    -t,--seqtype <SEQUENCE-TYPE>
        (n)ucleotide or (p)rotein. Default: n
    -appl,--applications
        Comma separated list of applications. Default: ALL
    -f,--formats <FORMAT-LIST>
        Case-insensitive, comma separated list of output formats.
        Supported formats are TSV, XML, GFF3, HTML, and SVG.
    -iprlookup,--iprlookup
        Include InterPro annotation in output.
    -goterms,--goterms
        Include GO terms in output. Implies -iprlookup.
    -pa,--pathways
        Include pathway annotation in output. Implies -iprlookup
    -ms,--minsize <MINIMUM-SIZE>
        Minimum nt size of ORF to report. Will only be considered
        if n is specified as a sequence type. Please be away that
            small values will lead to long runtimes.

    Use --interproscan-help to see a full description of all the options
    interproscan supports.

AUTHOR
    Wolfgang Resch <wresch@helix.nih.gov>

Setup and start the run

biowulf$ interproscan --goterms --pathways -f svg,tsv,html ecoli_benchmark.fa test 600
Called from:       /data/wresch/test_data/interproscan
Interproscan root: /usr/local/apps/interproscan/5.22-61.0
Interproscan opts:  --goterms --pathways -f svg,tsv,html
Infile:            ecoli_benchmark.fa
Chunksize:         600
Run name:          test
Output dir         /data/wresch/test_data/interproscan/test
sequences:   4303
jobs to run: 8
To submit your interproscan jobs do:

$ cd /data/wresch/test_data/interproscan/test
$ sbatch interproscan.batch

biowulf$ ls -lh test
total 1.9M
-rw-rw-r-- 1 wresch staff 253K Mar 14 10:05 ecoli_benchmark.fa.chunk0001
-rw-rw-r-- 1 wresch staff 213K Mar 14 10:05 ecoli_benchmark.fa.chunk0002
-rw-rw-r-- 1 wresch staff 247K Mar 14 10:05 ecoli_benchmark.fa.chunk0003
-rw-rw-r-- 1 wresch staff 296K Mar 14 10:06 ecoli_benchmark.fa.chunk0004
-rw-rw-r-- 1 wresch staff 275K Mar 14 10:06 ecoli_benchmark.fa.chunk0005
-rw-rw-r-- 1 wresch staff 223K Mar 14 10:06 ecoli_benchmark.fa.chunk0006
-rw-rw-r-- 1 wresch staff 260K Mar 14 10:06 ecoli_benchmark.fa.chunk0007
-rw-rw-r-- 1 wresch staff  36K Mar 14 10:06 ecoli_benchmark.fa.chunk0008
-rw-rw-r-- 1 wresch staff  950 Mar 14 10:06 interproscan.batch
drwxrwxr-x 2 wresch staff 4.0K Mar 14 10:06 slurm_logs

biowulf$ cd test
biowulf$ cat interproscan.batch

#! /bin/bash
#SBATCH --mem=10g
#SBATCH --cpus-per-task=8
#SBATCH --gres=lscratch:50
#SBATCH --output=slurm_logs/slurm-%A_%a.out
#SBATCH --array=1-8
#SBATCH --time=4:00:00
#SBATCH --partition=quick

fn=$(printf "%s.chunk%04i" "ecoli_benchmark.fa" $SLURM_ARRAY_TASK_ID)

cd /lscratch/$SLURM_JOB_ID
cp "/data/wresch/test_data/interproscan/test/${fn}" .
cp -r /usr/local/apps/interproscan/5.22-61.0/interproscan_app .
# make panther use a local tempdir
sed -i 's:panther.temporary.file.directory=.*:panther.temporary.file.directory=./ptemp:'     $(basename /usr/local/apps/interproscan/5.22-61.0/interproscan_app)/interproscan.properties
mkdir temp
mkdir ptemp
export PATH=$PWD/interproscan_app:$PATH
echo interproscan.sh  --goterms --pathways -f svg,tsv,html -T ./temp --disable-precalc -i "${fn}"
interproscan.sh  --goterms --pathways -f svg,tsv,html --highmem -T ./temp --disable-precalc -i "${fn}"
cp ${fn}.* /data/wresch/test_data/interproscan/test


biowulf$ sbatch interproscan.batch
biowulf$ jobload

     JOBID            TIME            NODES  CPUS  THREADS   LOAD       MEMORY
               Elapsed / Wall               Alloc   Active           Used /     Alloc
35660753_8    00:01:12 /    04:00:00 cn2232     8        9   112%     2.0 /   10.0 GB
35660753_1    00:01:12 /    04:00:00 cn2311     8        9   112%     1.7 /   10.0 GB
35660753_2    00:01:12 /    04:00:00 cn2314     8        8   100%     1.6 /   10.0 GB
35660753_3    00:01:12 /    04:00:00 cn2276     8        7    88%     1.6 /   10.0 GB
35660753_4    00:01:12 /    04:00:00 cn2277     8       11   138%     1.6 /   10.0 GB
35660753_5    00:01:12 /    04:00:00 cn2477     8        9   112%     1.5 /   10.0 GB
35660753_6    00:01:12 /    04:00:00 cn2478     8        8   100%     1.5 /   10.0 GB
35660753_7    00:01:12 /    04:00:00 cn2231     8        9   112%     1.7 /   10.0 GB

In this particular example the run takes about 1h. The following image is an example of what the SVG output looks like

example svg output