checkm2

checkm2 on Biowulf

Quick Links

Rapid assessment of genome bin quality using machine learning. From the documentation:

Unlike CheckM1, CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins. This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set. As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.

Documentation

CheckM2 on GitHub

Important Notes

Module Name: checkm2 (see the modules page for more information)
This application is multithreaded. Please match the number of allocated CPUs to the number of threads.
Example files in $CHECKM2_TEST_DATA
Benchmarking suggests that CheckM2 may not scale efficiently to more than 16 CPUs

CheckM2 requires that all input and output be located in lscratch

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --mem=12g --cpus-per-task=16 --gres=lscratch:20
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load checkm2
[user@cn3144 ~]$ wd="$PWD"
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ mkdir tmp
[user@cn3144 ~]$ cp -Lr $CHECKM2_TEST_DATA/fasta .
[user@cn3144 ~]$ checkm2 predict --threads=$SLURM_CPUS_PER_TASK \
    --input ./fasta \
    --output-directory ./checkm2-results
[user@cn3144 ~]$ ls -lh checkm2-results
total 24K
-rw-r--r-- 1 user group  678 Mar 26 17:57 checkm2.log
drwxr-xr-x 2 user group   41 Mar 26 17:47 diamond_output
drwxr-xr-x 2 user group 8.0K Mar 26 17:47 protein_files
-rw-r--r-- 1 user group 5.7K Mar 26 17:57 quality_report.tsv
[user@cn3144 ~]$ mv checkm2-results "$wd"
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Timing and efficiency with different numbers of CPUs for the 36 genomic bins in the test data set:

CPUs	Memory [GiB]	Runtime [minutes]	Est. efficiency
2	9	29.9	100%
4	9	15.3	98%
8	9	10.5	72%
16	9	6.2	60%

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. checkm2.sh). For example:

#!/bin/bash
set -e
module load checkm2/1.0.2
# uses lscratch automatically as tmp dir
wd="$PWD"
cd /lscratch/$SLURM_JOB_ID || exit 1
cp -r $CHECKM2_TEST_DATA/fasta .
checkm2 predict \
    --threads=$SLURM_CPUS_PER_TASK \
    --input ./fasta \
    --output-directory ./checkm2-results
mv checkm2-results "$wd"

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=8 --mem=12g --gres=lscratch:20 checkm2.sh