Dali on Biowulf

Quick Links

Interactive job

Batch job

Swarm job

AlphaFold searching

The three-dimensional co-ordinates of each protein are used to calculate residue - residue distance matrices.

References:

Holm J. Using Dali for protein structure comparison. Methods Mol. Biol. 2112, 29-42

Documentation

Dali Main Site

Important Notes

Module Name: dali (see the modules page for more information)
singlethreaded and MPI
Environment variables set
- DALI_HOME
- DALI_AF
Example files in $DALI_HOME/example/
Reference data in /pdb/ and $DALI_AF

Test script:

#!/bin/bash
#SBATCH -J dali_test --ntasks=4 --nodes=1
rm -rf test
ml dali
$DALI_HOME/test.csh

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --ntasks=4 --nodes=1
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load dali
[user@cn3144 ~]$ cp /pdb/pdb/pp/pdb1ppt.ent.gz .
[user@cn3144 ~]$ cp /pdb/pdb/bb/pdb1bba.ent.gz .
[user@cn3144 ~]$ import.pl --pdbfile pdb1ppt.ent.gz --pdbid 1ppt --dat ./
[user@cn3144 ~]$ import.pl --pdbfile pdb1bba.ent.gz --pdbid 1bba --dat ./
[user@cn3144 ~]$ dali.pl --pdbfile1 pdb1ppt.ent.gz --pdbfile2 pdb1bba.ent.gz --dat1 ./ --dat2 ./ --outfmt "summary,alignments"
[user@cn3144 ~]$ cat mol1A.txt
# Job: test
# Query: mol1A
# No:  Chain   Z    rmsd lali nres  %id PDB  Description
   1:  mol2-A  3.6  1.8   33    36   39   MOLECULE: BOVINE PANCREATIC POLYPEPTIDE;

# Pairwise alignments

No 1: Query=mol1A Sbjct=mol2A Z-score=3.6

DSSP  LLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHLLlll
Query GPSQPTYPGDDAPVEDLIRFYDNLQQYLNVVTRhry   36
ident  |  | |||| |  |        |  | |  ||
Sbjct APLEPEYPGDNATPEQMAQYAAELRRYINMLTRpry   36
DSSP  LLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHLLlll

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. dali.sh). For example:

#!/bin/bash
module load dali
import.pl --pdbfile pdb1ppt.ent.gz --pdbid 1ppt --dat ./
import.pl --pdbfile pdb1bba.ent.gz --pdbid 1bba --dat ./
dali.pl --pdbfile1 pdb1ppt.ent.gz --pdbfile2 pdb1bba.ent.gz --dat1 ./ --dat2 ./ --outfmt "summary,alignments"

Submit this job using the Slurm sbatch command.

sbatch dali.sh

MPI batch job

In certain circumstances, dali can be accelerated using MPI. To do so, include --np $SLURM_NTASKS with the command, and submit the job using --ntasks=# --nodes=1 , where # is the number of MPI tasks requested. MPI only works on a single node, so # must be less than the maximum number of cpus available on a single node. At present the maximum is 128; however, most nodes have only 56 cpus and so jobs requesting more than 56 cpus may wait a considerable time in the queue.

...
dali.pl --np $SLURM_NTASKS ...
...

Submit this job using the Slurm sbatch command.

sbatch --ntasks=32 --nodes=1 dali.sh

AlphaFold searching

Running with the AlphaFold database:

#!/bin/bash
module load dali

zcat /pdb/pdb/fd/pdb1fd3.ent.gz > 1fd3.pdb
import.pl --pdbfile 1fd3.pdb --pdbid 1fd3 --dat ./ --clean

dali.pl \
  --title "my search" \
  --cd1 1fd3B \
  --dat1 ./ \
  --db ${DALI_AF}/Digest/HUMAN.list \
  --BLAST_DB ${DALI_AF}/Digest/AF.fasta \
  --repset ${DALI_AF}/Digest/HUMAN_70.list \
  --dat2 ${DALI_AF}/DAT/ \
  --clean \
  --hierarchical \
  --oneway \
  --np ${SLURM_NTASKS}

Type ls ${DALI_AF}/Digest to see all the lists.

NOTES:

The import.pl process may fail due to formatting errors in the original pdb file. Make sure that it completes normally and individual .dat files are created for each chain. For the example (1fd3.pdb), there are four chains in the original pdb:
```
1fd3A.dat
1fd3B.dat
1fd3C.dat
1fd3D.dat
1fd3.pdb
```
The value of --cd1 must match the desired chain. So for the B chain of 1fd3.pdb, use 1fd3B
Make sure to use ${SLURM_NTASKS} for the value of --np; also be sure to allocate the job with --ntasks.

Swarm job

Running dali in a swarm presents some complications because dali creates temporary files with identical names in the current working directory. Running multiple dali instances simultaneously in the same directory will cause file clobbering.

To prevent this, each step must be run within its own unique directory. To simplify this, a convenience script auto.sh should be used, rather than calling dali directly in a swarm. Here is the code:

#!/bin/bash

# this script is 'auto.sh'

# require a 5 character pdb string

pdb5c=$1
[[ -n ${pdb5c} ]] || { echo "No 5 character pdb string given"; exit 1; }

# base directory where everything stored

here=$(pwd)

# Check if a .dat file exists, and if not run import.pl

if ls ${here}/DAT/${pdb5c:0:4}/${pdb5c:0:4}*.dat 1> /dev/null 2>&1; then
    echo import.pl ${pdb5c:0:4} done
else

# run import.pl in a unique temporary subdirectory to prevent file clobbering, and to
# keep the pathnames below 80 characters

    t=$(mktemp -d) && cd ${t}
    cp ${here}/${pdb5c:0:4}.pdb .
    mkdir -p DAT/${pdb5c:0:4}
    import.pl --pdbfile ${pdb5c:0:4}.pdb --pdbid ${pdb5c:0:4} --dat DAT/${pdb5c:0:4}/  --clean
	mv ${t}/DAT ${here}/.
    cd ${here} && rm -rf ${t}
fi

# check if dali has already been run

if ls ${here}/OUT/${pdb5c:0:5}/*.blast 1> /dev/null 2>&1; then
    echo dali.pl ${pdb5c:0:5} done
    exit
fi

# check if the .dat file actually exists

if [[ ! -f ${here}/DAT/${pdb5c:0:4}/${pdb5c:0:5}.dat ]]; then
    echo "${here}/DAT/${pdb5c:0:4}/${pdb5c:0:5}.dat does not exist"
    exit
fi

# run dali.pl in a unique subdirectory to prevent file clobbering

cd ${here}
rm -f OUT/${pdb5c:0:5}.txt
mkdir -p OUT/${pdb5c:0:5} && cd $_
dali.pl \
  --title "mysearch_${pdb5c:0:5}" \
  --cd1 ${pdb5c:0:5} \
  --dat1 DAT/${pdb5c:0:4}/ \
  --db ${DALI_AF}/Digest/AF.list \
  --BLAST_DB ${DALI_AF}/Digest/AF.fasta \
  --repset ${DALI_AF}/Digest/AF.list \
  --dat2 ${DALI_AF}/DAT/ \
  --clean \
  --hierarchical \
  --oneway 
  --np ${SLURM_NTASKS:-2}

This script assumes that the pdb file (e.g. 1efe.pdb) exists in the current working directory, and can be run like so:

module load dali; bash auto.sh 1efeA

The script will create files in a distinct hierarchical layout in the current working directory:

.
├── 1efe.pdb
├── DAT
│   └── 1efe
│       └── 1efeA.dat
└── OUT
    └── 1efeA
        ├── 1840494.blast
        ├── 1840494.fasta
        └── 1efeA.txt

This script can be used to run a swarm job. A swarm file swarm.txt is created:

#SWARM --sbatch='--ntasks 16' --module dali --logdir LOG --time 120
bash auto.sh 1efeA
bash auto.sh 1fd3B
bash auto.sh 1fldA
bash auto.sh 1aieA
bash auto.sh 1c26A
bash auto.sh 1hgvA
bash auto.sh 1hgzA
bash auto.sh 1hh0A
bash auto.sh 1ifpA
bash auto.sh 1oiaA

Again, this assumes all pdb files exist in the current working directory. The swarm can be launched as follows:

swarm swarm.txt

NOTES:

The auto.sh script should be edited if needed.
The swarm allocate 16 tasks (--ntasks=16) and 120 minutes (--time 120) for each run.
Swarm logfiles are written to the LOG subdirectory.
The auto.sh script expects a 5 character string as its sole argument. The last character corresponds to a specific chain in the pdb file.
The auto.sh script checks if the import.pl and dali.pl steps have completed successfully, and so the swarm can be run multiple times without overwriting previous runs.