CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --cpus-per-task=4 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load cd-hit [user@cn3144 ~]$ cd-hit -i /fdb/fastadb/drosoph.aa.fas -o drosoph100 -c 1.00 -n 5 -M 16000 -d 0 -T $SLURM_CPUS_PER_TASK ================================================================ Program: CD-HIT, V4.7 (+OpenMP), May 10 2018, 13:51:01 Command: cd-hit -i /fdb/fastadb/drosoph.aa.fas -o drosoph100 -c 1.00 -n 5 -M 16000 -d 0 -T 4 Started: Thu May 10 13:55:50 2018 ================================================================ Output ---------------------------------------------------------------- Warning: total number of CPUs in the system is 2 Actual number of CPUs to be used: 2 total seq: 14329 longest and shortest : 8805 and 11 Total letters: 7178839 Sequences have been sorted Approximated minimal memory consumption: Sequence : 9M Buffer : 2 X 12M = 25M Table : 2 X 65M = 131M Miscellaneous : 0M Total : 165M Table limit with the given memory limit: Max number of representatives: 4000000 Max number of word counting entries: 1979299596 # comparing sequences from 0 to 3582 ...---------- new table with 3409 representatives # comparing sequences from 3582 to 6268 99.9%---------- new table with 2559 representatives # comparing sequences from 6268 to 8283 ---------- 265 remaining sequences to the next cycle ---------- new table with 1680 representatives # comparing sequences from 8018 to 9595 ---------- 317 remaining sequences to the next cycle ---------- new table with 1215 representatives # comparing sequences from 9278 to 10540 .......... 10000 finished 9566 clusters ---------- 446 remaining sequences to the next cycle ---------- new table with 794 representatives # comparing sequences from 10094 to 11152 ---------- 269 remaining sequences to the next cycle ---------- new table with 769 representatives # comparing sequences from 10883 to 11744 ---------- 207 remaining sequences to the next cycle ---------- new table with 625 representatives # comparing sequences from 11537 to 12235 ---------- 169 remaining sequences to the next cycle ---------- new table with 516 representatives # comparing sequences from 12066 to 12631 ---------- 121 remaining sequences to the next cycle ---------- new table with 430 representatives # comparing sequences from 12510 to 12964 ---------- 103 remaining sequences to the next cycle ---------- new table with 346 representatives # comparing sequences from 12861 to 13228 ---------- 76 remaining sequences to the next cycle ---------- new table with 284 representatives # comparing sequences from 13152 to 13446 ---------- 60 remaining sequences to the next cycle ---------- new table with 227 representatives # comparing sequences from 13386 to 14329 .....................---------- new table with 924 representatives 14329 finished 13778 clusters Apprixmated maximum memory consumption: 205M writing new database writing clustering information program completed ! Total CPU time 3.83 [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. cd-hit.sh). For example:
#!/bin/bash set -e module load cd-hit cd-hit -i /fdb/fastadb/drosoph.aa.fas -o drosoph100 -c 1.00 -n 5 -M 16000 -d 0 -T $SLURM_CPUS_PER_TASK
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] cd-hit.sh