Clair3: Integrating pileup and full-alignment for high-performance long-read variant calling

Clair3 is a small variant caller for Illumina, PacBio and ONT long reads. Compare to PEPPER (r0.4), Clair3 (v0.1) shows a better SNP F1-score with ≤30-fold of ONT data (precisionFDA Truth Challenge V2), and a better Indel F1-score, while runs generally four times faster.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session on a GPU node:

[user@biowulf ~]$ sinteractive --gres=gpu:p100:1,lscratch:10 --mem=16g -c4
[user@cn2379 ~]$ module load clair3
[+] Loading singularity  4.0.1  on cn0795
[+] Loading cuDNN/8.1.0.77/CUDA-11.2.2 libraries...
[+] Loading CUDA Toolkit  11.2.2  ...
[+] Loading clair3 1.0.4  ...
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/* .

[user@cn2379 ~]$ THREADS=4
[user@cn2379 ~]$ OUTPUT_VCF_FILE_PATH=merge_output.vcf.gz
Processing sample Illumina data
 
[user@cn2379 ~]$ PLATFORM='ilmn' 
[user@cn2379 ~]$ INPUT_DIR="Illumina"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
[user@cn2379 ~]$ clair3 \
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \
  --model_path=${CLAIR3_MODELS}/${PLATFORM} \
  --threads=${THREADS} \
  --platform=${PLATFORM} \
  --output=./ \
  --bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH} 
...
Processing sample PacBio Hifi data
[user@cn2379 ~]$ PLATFORM='hifi' 
[user@cn2379 ~]$ INPUT_DIR="PacBio"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
[user@cn2379 ~]$ clair3 \
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \
  --threads=${THREADS} \
  --platform=${PLATFORM} \
  --model_path=${CLAIR3_MODELS}/${PLATFORM} \
  --output=./ \
  --bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH} 
...
Processing sample ONT data using a custom model
[user@cn2379 ~]$ PLATFORM='ont' 
[user@cn2379 ~]$ INPUT_DIR="ONT"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
Download a custom model:
[user@cn2379 ~]$ wget http://www.bio8.cs.hku.hk/clair3/clair3_models/r1041_e82_400bps_sup_v420.tar.gz 
[user@cn2379 ~]$ tar -zxf r1041_e82_400bps_sup_v420.tar.gz && rm -f r1041_e82_400bps_sup_v420.tar.gz
The last command creates a model folder r1041_e82_400bps_sup_v420.
Now run the custom model on the data:
[user@cn2379 ~]$ clair3 \
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \
  --threads=${THREADS} \
  --platform=${PLATFORM} \                                                                            
  --model_path=${PWD}/r1041_e82_400bps_sup_v420 \
  --output=./ \
  --vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH} 
...
[INFO] CLAIR3 VERSION: v1.0.4
[INFO] BAM FILE PATH: /vf/users/denisovga/Clair3/ONT/HG003_chr20_demo.bam
[INFO] REFERENCE FILE PATH: /vf/users/denisovga/Clair3/ONT/GRCh38_no_alt_chr20.fa
[INFO] MODEL PATH: /data/denisovga/Clair3/r1041_e82_400bps_sup_v420
[INFO] OUTPUT FOLDER: /vf/users/denisovga/Clair3/.
[INFO] PLATFORM: ont
[INFO] THREADS: 4
[INFO] BED FILE PATH: EMPTY
[INFO] VCF FILE PATH: /vf/users/denisovga/Clair3/ONT/HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz
[INFO] CONTIGS: EMPTY
[INFO] CONDA PREFIX:
[INFO] SAMTOOLS PATH: samtools
[INFO] PYTHON PATH: python3
[INFO] PYPY PATH: pypy3
[INFO] PARALLEL PATH: parallel
[INFO] WHATSHAP PATH: whatshap
[INFO] LONGPHASE PATH: EMPTY
[INFO] CHUNK SIZE: 5000000
[INFO] FULL ALIGN PROPORTION: 0.7
[INFO] FULL ALIGN REFERENCE PROPORTION: 0.1
[INFO] PHASING PROPORTION: 0.7
...
[INFO] 1/7 Call variants using pileup model
Calling variants ...
...
[INFO] 2/7 Select heterozygous SNP variants for Whatshap phasing and haplotagging
[INFO] Select heterozygous pileup variants exceeding phasing quality cutoff 14
[INFO] Total heterozygous SNP positions selected: chr20: 307

real    0m0.345s
user    0m0.263s
sys     0m0.069s

[INFO] 3/7 Phase VCF file using Whatshap
This is WhatsHap 1.7 running under Python 3.9.0
Working on 1 sample from 1 family

# Working on contig chr20 in individual SAMPLE
Found 307 usable heterozygous variants (0 skipped due to missing genotypes)
...
[INFO] 5/7 Select candidates for full-alignment calling
[INFO] Set variants quality cutoff 19.0
[INFO] Set reference calls quality cutoff 5.0
[INFO] Low quality reference calls to be processed in chr20: 6
[INFO] Low quality variants to be processed in chr20: 449

real    0m0.340s
user    0m0.250s
sys     0m0.083s

[INFO] 6/7 Call low-quality variants using full-alignment model
Calling variants ...
Total processed positions in chr20 (chunk 1/1) : 455
Total time elapsed: 5.21 s

real    0m8.084s
user    0m7.120s
sys     0m0.750s

[INFO] 7/7 Merge pileup VCF and full-alignment VCF
[INFO] Pileup variants processed in chr20: 195
[INFO] Full-alignment variants processed in chr20: 445
...
[INFO] Finish calling, output file: /vf/users/denisovga/Clair3/./merge_output.vcf.gz

real    0m59.203s
user    1m55.518s
sys     0m13.950s

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. clair3.sh). For example:

#!/bin/bash
set -e
module load Clair3 
...               
clair3 \ 
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${INPUT_DIR}/${REF} \                                                                      --threads=${THREADS} \
  --platform=${PLATFORM} \                                                                            --model_path=${CLAIR3_MODELS}/${PLATFORM} \
  --output=./ \
  --vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH}

Submit this job using the Slurm sbatch command.

sbatch clair3.sh