High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
CD-HIT on Biowulf & Helix

CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.

CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik's Lab at the Burnham Institute (now Sanford-Burnham Medical Research Institute)

On Helix

Sample session:

helix% module load cd-hit

helix% cd-hit -i /fdb/fastadb/swiss.aa.fas  -o swiss100 -c 1.00 -n 5 -M 16000 -d 0  -T 4

The number of threads (the -T 4 flag in the command above) should be set to a max of 4 on Helix.

Batch job on Biowulf

Sample batch script:

#!/bin/bash

cd /data/$USER/mydir 

module load cd-hit

cd-hit -i /fdb/fastadb/swiss.aa.fas  -o swiss100 -c 1.00 -n 5 -M 16000 -d 0  -T $SLURM_CPUS_PER_TASK

Submit with:

sbatch --cpus-per-task=8   jobscript

This job will be run on 8 CPUs with a default allocation of 16 GB of memory. (2 GB per CPU) CD-HIT can be memory-intensive. If your job fails, type 'jobhist jobnumber' and check the memory used. You may need to allocate more memory for your job, by adding something like '--mem=20g' to the sbatch command line. As an example, if the input file is the entire NCBI nr database (/fdb/fastadb/nr.aa.fas), the job requires about 45 GB of memory.

Interactive job on Biowulf

Sample session:

biowulf % sinteractive --cpus-per-task=8
salloc.exe: Granted job allocation 171619

[susanc@cn0074 ~]$ module load cd-hit

[susanc@cn0074 ~]$ cd-hit -i /fdb/fastadb/mito.aa.fas  -o mito.out -c 1.00 -n 5 -M 16000 -d 0  -T 8
================================================================
Program: CD-HIT, V4.5.4 (+OpenMP), Jun 26 2015, 12:51:19
Command: cd-hit -i /fdb/fastadb/mito.aa.fas -o mito.out -c
         1.00 -n 5 -M 16000 -d 0 -T 8

Started: Mon Jun 29 11:10:58 2015
================================================================
                            Output
----------------------------------------------------------------
total seq: 74981
longest and shortest : 2640 and 13
Total letters: 21853654
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 32M
Buffer          : 8 X 11M = 95M
Table           : 2 X 66M = 133M
Miscellaneous   : 1M
Total           : 261M

[...etc...]

[susanc@cn0074 ~]$ exit
salloc.exe: Relinquishing job allocation 171619
salloc.exe: Job allocation 171619 has been revoked.
[susanc@biowulf ~]$

Documentation

CD-HIT User's Guide