checkm2 on Biowulf

Rapid assessment of genome bin quality using machine learning. From the documentation:

Unlike CheckM1, CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins. This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set. As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.
Documentation
Important Notes

CheckM2 requires that all input and output be located in lscratch

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --mem=12g --cpus-per-task=16 --gres=lscratch:20
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load checkm2
[user@cn3144 ~]$ wd="$PWD"
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ mkdir tmp
[user@cn3144 ~]$ cp -Lr $CHECKM2_TEST_DATA/fasta .
[user@cn3144 ~]$ checkm2 predict --threads=$SLURM_CPUS_PER_TASK \
    --input ./fasta \
    --output-directory ./checkm2-results
[user@cn3144 ~]$ ls -lh checkm2-results
total 24K
-rw-r--r-- 1 user group  678 Mar 26 17:57 checkm2.log
drwxr-xr-x 2 user group   41 Mar 26 17:47 diamond_output
drwxr-xr-x 2 user group 8.0K Mar 26 17:47 protein_files
-rw-r--r-- 1 user group 5.7K Mar 26 17:57 quality_report.tsv
[user@cn3144 ~]$ mv checkm2-results "$wd"
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Timing and efficiency with different numbers of CPUs for the 36 genomic bins in the test data set:

CPUs Memory [GiB] Runtime [minutes] Est. efficiency
2 9 29.9 100%
4 9 15.3 98%
8 9 10.5 72%
16 9 6.2 60%

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. checkm2.sh). For example:

#!/bin/bash
set -e
module load checkm2/1.0.2
# uses lscratch automatically as tmp dir
wd="$PWD"
cd /lscratch/$SLURM_JOB_ID || exit 1
cp -r $CHECKM2_TEST_DATA/fasta .
checkm2 predict \
    --threads=$SLURM_CPUS_PER_TASK \
    --input ./fasta \
    --output-directory ./checkm2-results
mv checkm2-results "$wd"

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=8 --mem=12g --gres=lscratch:20 checkm2.sh