Eagle: reference-based haplotype phasing
Eagle performs a reference-based haplotype phasing. It attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform.
- Po-Ru Loh, Petr Danecek, Pier Francesco Palamara, Christian Fuchsberger, Yakir A Reshef,
Hilary K Finucane, Sebastian Schoenherr, Lukas Forer, Shane McCarthy, Goncalo R Abecasis,
Richard Durbin and Alkes L Price,
"Reference-based phasing using the Haplotype Reference Consortium panel",
Nature Genetics , 2016, 48(11), 1443-1450. - Po-Ru Loh, Pier Francesco Palamara and Alkes L Price, "Fast and accurate long-range phasing in a UK Biobank cohort",
Nature Genetics , 2016, 48(7), 811-819.
Important Notes
- Module Name: Eagle (see the modules page for more information)
- Unusual environment variables set
- EAGLE_DIR Eagle installation directory
- EAGLE_BIN Eagle executable folder
- EAGLE_DATA sample data for running Eagle
- EAGLE_TABLES sample data for running Eagle
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive [user@cn3144 ~]$ module load EagleCopy sample data from an application folder to current folder:
[user@cn3144 ~]$ cp $EAGLE_DATA/* .Run Eagle on the sample data: :
[user@cn3144 ~]$ eagle --bfile=EUR_test --geneticMapFile=USE_BIM --chrom=21 --outPrefix=phased --numThreads=4 2>&1 | tee example.log +-----------------------------+ | | | Eagle v2.4 | | December 13, 2017 | | Po-Ru Loh | | | +-----------------------------+ Copyright (C) 2015-2017 Harvard University. Distributed under the GNU GPLv3+ open source license. Command line options: eagle \ --bfile=EUR_test \ --geneticMapFile=USE_BIM \ --chrom=21 \ --outPrefix=phased \ --numThreads=4 Setting number of threads to 4 === Reading genotype data === Reading fam file: EUR_test.fam Total indivs in PLINK data: Nbed = 379 Total indivs stored in memory: NpreQC = 379 Reading bim file: EUR_test.bim Total snps in PLINK data: Mbed = 2000 Restricting to 1813 SNPs on chrom 21 in region [bpStart,bpEnd] = [0,1e+09] Total SNPs stored in memory: MpreQC = 1813 Allocating 1813 x 379 bytes to temporarily store genotypes Reading genotypes and performing QC filtering on snps and indivs... Reading bed file: EUR_test.bed Expecting 190000 (+3) bytes for 379 indivs, 2000 snps Total post-QC indivs: N = 379 Total post-QC SNPs: M = 1813 MAF spectrum: 0- 5%: 495 5-10%: 290 10-20%: 332 20-30%: 248 30-40%: 234 40-50%: 214 Physical distance range: 9752235 base pairs Genetic distance range: 23.0881 cM Average # SNPs per cM: 79 Auto-selecting --maxBlockLen: 0.25 cM Number of <=(64-SNP, 0.25cM) segments: 68 Average # SNPs per segment: 26 Estimating LD scores using 379 indivs Fraction of heterozygous genotypes: 0.246308 Typical span of default 100-het history length: 5.17 cM Setting --histFactor=1.00 BEGINNING STEP 1 Time for step 1: 0.867686 Time for step 1 MN^2: 0.0521836 Making hard calls (time: 0.0207999) BEGINNING STEP 2 BATCH 1 OF 1 Building hash tables .................................................................. (time: 0.136335) Phasing samples 1-379 Time for phasing batch: 1.03954 Making hard calls (time: 0.020123) Time for step 2: 1.19602 Time for step 2 MN^2: 0.158607 BEGINNING STEP 3 (PBWT ITERS) Auto-selecting number of PBWT iterations: setting --pbwtIters to 2 BEGINNING PBWT ITER 1 BATCH 1 OF 10 Phasing samples 1-37 Time for phasing batch: 3.31806 BATCH 2 OF 10 Phasing samples 38-75 Time for phasing batch: 3.23385 ... BATCH 10 OF 10 Phasing samples 342-379 Time for phasing batch: 3.21097 Time for PBWT iter 1: 31.8771 BEGINNING PBWT ITER 2 BATCH 1 OF 10 Phasing samples 1-37 Time for phasing batch: 5.23776 BATCH 2 OF 10 Phasing samples 38-75 Time for phasing batch: 5.15485 ... BATCH 9 OF 10 Phasing samples 304-341 Time for phasing batch: 5.06871 BATCH 10 OF 10 Phasing samples 342-379 Time for phasing batch: 5.19495 Time for PBWT iter 2: 51.1316 Writing .haps.gz and .sample output Time for writing output: 0.23035 Total elapsed time for analysis = 85.4332 sec [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. Eagle.sh). For example:
#!/bin/bash module load Eagle eagle \ --vcf=EUR_test.vcf.gz \ --geneticMapFile=$EAGLE_TABLES/genetic_map_hg19_withX.txt.gz \ --chrom=21 \ --outPrefix=phased \ --numThreads=4 \ 2>&1 | tee example_vcf.log
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] Eagle.sh