Rapid assessment of genome bin quality using machine learning. From the documentation:
Unlike CheckM1, CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins. This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set. As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.
CheckM2 requires that all input and output be located in lscratch
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --mem=12g --cpus-per-task=16 --gres=lscratch:20 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load checkm2 [user@cn3144 ~]$ wd="$PWD" [user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144 ~]$ mkdir tmp [user@cn3144 ~]$ cp -Lr $CHECKM2_TEST_DATA/fasta . [user@cn3144 ~]$ checkm2 predict --threads=$SLURM_CPUS_PER_TASK \ --input ./fasta \ --output-directory ./checkm2-results [user@cn3144 ~]$ ls -lh checkm2-results total 24K -rw-r--r-- 1 user group 678 Mar 26 17:57 checkm2.log drwxr-xr-x 2 user group 41 Mar 26 17:47 diamond_output drwxr-xr-x 2 user group 8.0K Mar 26 17:47 protein_files -rw-r--r-- 1 user group 5.7K Mar 26 17:57 quality_report.tsv [user@cn3144 ~]$ mv checkm2-results "$wd" [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Timing and efficiency with different numbers of CPUs for the 36 genomic bins in the test data set:
CPUs | Memory [GiB] | Runtime [minutes] | Est. efficiency |
---|---|---|---|
2 | 9 | 29.9 | 100% |
4 | 9 | 15.3 | 98% |
8 | 9 | 10.5 | 72% |
16 | 9 | 6.2 | 60% |
Create a batch input file (e.g. checkm2.sh). For example:
#!/bin/bash set -e module load checkm2/1.0.2 # uses lscratch automatically as tmp dir wd="$PWD" cd /lscratch/$SLURM_JOB_ID || exit 1 cp -r $CHECKM2_TEST_DATA/fasta . checkm2 predict \ --threads=$SLURM_CPUS_PER_TASK \ --input ./fasta \ --output-directory ./checkm2-results mv checkm2-results "$wd"
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=8 --mem=12g --gres=lscratch:20 checkm2.sh