Clair3 is a small variant caller for Illumina, PacBio and ONT long reads. Compare to PEPPER (r0.4), Clair3 (v0.1) shows a better SNP F1-score with ≤30-fold of ONT data (precisionFDA Truth Challenge V2), and a better Indel F1-score, while runs generally four times faster.
Allocate an interactive session and run the program.
Sample session on a GPU node:
[user@biowulf ~]$ sinteractive --gres=gpu:p100:1,lscratch:10 --mem=16g -c4 [user@cn2379 ~]$ module load clair3 [+] Loading singularity 4.0.1 on cn0795 [+] Loading cuDNN/8.1.0.77/CUDA-11.2.2 libraries... [+] Loading CUDA Toolkit 11.2.2 ... [+] Loading clair3 1.0.4 ... [user@cn2379 ~]$ cp -r $CLAIR3_DATA/* . [user@cn2379 ~]$ THREADS=4 [user@cn2379 ~]$ OUTPUT_VCF_FILE_PATH=merge_output.vcf.gzProcessing sample Illumina data
[user@cn2379 ~]$ PLATFORM='ilmn' [user@cn2379 ~]$ INPUT_DIR="Illumina" [user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} . [user@cn2379 ~]$ REF="GRCh38_chr20.fa" [user@cn2379 ~]$ BAM="HG003_chr20_demo.bam" [user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz" [user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed" [user@cn2379 ~]$ clair3 \ --bam_fn=${INPUT_DIR}/${BAM} \ --ref_fn=${INPUT_DIR}/${REF} \ --model_path=${CLAIR3_MODELS}/${PLATFORM} \ --threads=${THREADS} \ --platform=${PLATFORM} \ --output=./ \ --bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH} ...Processing sample PacBio Hifi data
[user@cn2379 ~]$ PLATFORM='hifi' [user@cn2379 ~]$ INPUT_DIR="PacBio" [user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} . [user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa" [user@cn2379 ~]$ BAM="HG003_chr20_demo.bam" [user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz" [user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed" [user@cn2379 ~]$ clair3 \ --bam_fn=${INPUT_DIR}/${BAM} \ --ref_fn=${INPUT_DIR}/${REF} \ --threads=${THREADS} \ --platform=${PLATFORM} \ --model_path=${CLAIR3_MODELS}/${PLATFORM} \ --output=./ \ --bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH} ...Processing sample ONT data using a custom model
[user@cn2379 ~]$ PLATFORM='ont' [user@cn2379 ~]$ INPUT_DIR="ONT" [user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} . [user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa" [user@cn2379 ~]$ BAM="HG003_chr20_demo.bam" [user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz" [user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"Download a custom model:
[user@cn2379 ~]$ wget http://www.bio8.cs.hku.hk/clair3/clair3_models/r1041_e82_400bps_sup_v420.tar.gz [user@cn2379 ~]$ tar -zxf r1041_e82_400bps_sup_v420.tar.gz && rm -f r1041_e82_400bps_sup_v420.tar.gzThe last command creates a model folder r1041_e82_400bps_sup_v420.
[user@cn2379 ~]$ clair3 \ --bam_fn=${INPUT_DIR}/${BAM} \ --ref_fn=${INPUT_DIR}/${REF} \ --threads=${THREADS} \ --platform=${PLATFORM} \ --model_path=${PWD}/r1041_e82_400bps_sup_v420 \ --output=./ \ --vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH} ... [INFO] CLAIR3 VERSION: v1.0.4 [INFO] BAM FILE PATH: /vf/users/denisovga/Clair3/ONT/HG003_chr20_demo.bam [INFO] REFERENCE FILE PATH: /vf/users/denisovga/Clair3/ONT/GRCh38_no_alt_chr20.fa [INFO] MODEL PATH: /data/denisovga/Clair3/r1041_e82_400bps_sup_v420 [INFO] OUTPUT FOLDER: /vf/users/denisovga/Clair3/. [INFO] PLATFORM: ont [INFO] THREADS: 4 [INFO] BED FILE PATH: EMPTY [INFO] VCF FILE PATH: /vf/users/denisovga/Clair3/ONT/HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz [INFO] CONTIGS: EMPTY [INFO] CONDA PREFIX: [INFO] SAMTOOLS PATH: samtools [INFO] PYTHON PATH: python3 [INFO] PYPY PATH: pypy3 [INFO] PARALLEL PATH: parallel [INFO] WHATSHAP PATH: whatshap [INFO] LONGPHASE PATH: EMPTY [INFO] CHUNK SIZE: 5000000 [INFO] FULL ALIGN PROPORTION: 0.7 [INFO] FULL ALIGN REFERENCE PROPORTION: 0.1 [INFO] PHASING PROPORTION: 0.7 ... [INFO] 1/7 Call variants using pileup model Calling variants ... ... [INFO] 2/7 Select heterozygous SNP variants for Whatshap phasing and haplotagging [INFO] Select heterozygous pileup variants exceeding phasing quality cutoff 14 [INFO] Total heterozygous SNP positions selected: chr20: 307 real 0m0.345s user 0m0.263s sys 0m0.069s [INFO] 3/7 Phase VCF file using Whatshap This is WhatsHap 1.7 running under Python 3.9.0 Working on 1 sample from 1 family # Working on contig chr20 in individual SAMPLE Found 307 usable heterozygous variants (0 skipped due to missing genotypes) ... [INFO] 5/7 Select candidates for full-alignment calling [INFO] Set variants quality cutoff 19.0 [INFO] Set reference calls quality cutoff 5.0 [INFO] Low quality reference calls to be processed in chr20: 6 [INFO] Low quality variants to be processed in chr20: 449 real 0m0.340s user 0m0.250s sys 0m0.083s [INFO] 6/7 Call low-quality variants using full-alignment model Calling variants ... Total processed positions in chr20 (chunk 1/1) : 455 Total time elapsed: 5.21 s real 0m8.084s user 0m7.120s sys 0m0.750s [INFO] 7/7 Merge pileup VCF and full-alignment VCF [INFO] Pileup variants processed in chr20: 195 [INFO] Full-alignment variants processed in chr20: 445 ... [INFO] Finish calling, output file: /vf/users/denisovga/Clair3/./merge_output.vcf.gz real 0m59.203s user 1m55.518s sys 0m13.950s
Create a batch input file (e.g. clair3.sh). For example:
#!/bin/bash set -e module load Clair3 ... clair3 \ --bam_fn=${INPUT_DIR}/${BAM} \ --ref_fn=${INPUT_DIR}/${REF} \ --threads=${THREADS} \ --platform=${PLATFORM} \ --model_path=${CLAIR3_MODELS}/${PLATFORM} \ --output=./ \ --vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH}
Submit this job using the Slurm sbatch command.
sbatch clair3.sh