Clair3: Integrating pileup and full-alignment for high-performance long-read variant calling

Clair3 is a small variant caller for Illumina, PacBio and ONT long reads. Compare to PEPPER (r0.4), Clair3 (v0.1) shows a better SNP F1-score with ≤30-fold of ONT data (precisionFDA Truth Challenge V2), and a better Indel F1-score, while runs generally four times faster.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session on a GPU node:

[user@biowulf ~]$ sinteractive --gres=gpu:p100:1,lscratch:10 --mem=16g -c4
[user@cn2379 ~]$ module load clair3
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/* .

[user@cn2379 ~]$ THREADS=4
[user@cn2379 ~]$ OUTPUT_VCF_FILE_PATH=merge_output.vcf.gz
Processing sample Illumina data
 
[user@cn2379 ~]$ PLATFORM='ilmn' 
[user@cn2379 ~]$ INPUT_DIR="Illumina"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
[user@cn2379 ~]$ clair3 \
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \
  --model_path=${CLAIR3_MODELS}/${PLATFORM} \
  --threads=${THREADS} \
  --platform=${PLATFORM} \
  --output=./ \
  --bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH} 
...
Processing sample PacBio Hifi data
[user@cn2379 ~]$ PLATFORM='hifi' 
[user@cn2379 ~]$ INPUT_DIR="PacBio"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
[user@cn2379 ~]$ clair3 \
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \
  --threads=${THREADS} \
  --platform=${PLATFORM} \
  --model_path=${CLAIR3_MODELS}/${PLATFORM} \
  --output=./ \
  --bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH} 
...
Processing sample ONT data
[user@cn2379 ~]$ PLATFORM='ont' 
[user@cn2379 ~]$ INPUT_DIR="ONT"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
[user@cn2379 ~]$ clair3 \
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \
  --threads=${THREADS} \
  --platform=${PLATFORM} \                                                                            --model_path=${CLAIR3_MODELS}/${PLATFORM} \
  --output=./ \
  --vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH} 
...

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. clair3.sh). For example:

#!/bin/bash
set -e
module load Clair3 
...               
clair3 \ 
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \                                                                      --threads=${THREADS} \
  --platform=${PLATFORM} \                                                                            --model_path=${CLAIR3_MODELS}/${PLATFORM} \
  --output=./ \
  --vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH}

Submit this job using the Slurm sbatch command.

sbatch clair3.sh