High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Impute, qctool, gtool and snptest on Biowulf & Helix

IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009:

QCTOOL is a command-line utility program for basic quality control of GWAS datasets. It supports the same file formats used by the WTCCC studies, as well as the binary file format described on the qctool webpage and the Variant Call Format, and is designed to work seamlessly with SNPTEST and related tools

GTOOL is a program for transforming sets of genotype data for use with the programs SNPTEST and IMPUTE. GTOOL can be used to:

SNPTEST is a program for the analysis of single SNP association in genome-wide studies. The tests implemented include

Impute, qctool, gtool and snptest were all developed at the University of Oxford.

To add all these executables to your path, use

module load impute qctool gtool snptest  
Of course, you can use 'module avail programname' to see available versions of any of these programs.
[susanc@biowulf ~]$ module avail impute snptest

---------------------------- /usr/local/lmod/modulefiles ----------------------------------------
   impute/2.3.2    snptest/2.5

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

On Helix

Sample impute job:

[user@helix mydir]$ cp -R /usr/local/apps/impute/Example .

[user@helix mydir]$ module load impute

[user@helix mydir]$ impute2 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.phased.impute2

 IMPUTE version 2.3.2

Copyright 2008 Bryan Howie, Peter Donnelly, and Jonathan Marchini
Please see the LICENCE file included with this program for conditions of use.

The seed for the random number generator is 74894808.

Command-line input: impute2 -m ./Example/example.chr22.map -h ./Example/example.chr22.1kG.haps -l ./Example/example.chr22.1kG.legend -g ./Example/example.chr22.study.gens -strand_g ./Example/example.chr22.study.strand -int 20.4e6 20.5e6 -Ne 20000 -o ./Example/example.chr22.one.phased.impute2

 Nomenclature and data structure

     Panel 0: phased reference haplotypes
     Panel 2: unphased study genotypes

For optimal results, each successive panel (0,1,2) should contain a subset of the SNPs in the previous panel. When the data structure deviates from this ideal configuration, IMPUTE2 tries to use as much of the available information as possible; see documentation for details.

 Input files

         Panel 0 haplotypes: ./Example/example.chr22.1kG.haps
         Panel 0 hap legend: ./Example/example.chr22.1kG.legend
          Panel 2 genotypes: ./Example/example.chr22.study.gens
        Panel 2 strand info: ./Example/example.chr22.study.strand
                genetic map: ./Example/example.chr22.map

 Output files

                main output: ./Example/example.chr22.one.phased.impute2
                SNP QC info: ./Example/example.chr22.one.phased.impute2_info
             sample QC info: ./Example/example.chr22.one.phased.impute2_info_by_sample
                run summary: ./Example/example.chr22.one.phased.impute2_summary
                warning log: ./Example/example.chr22.one.phased.impute2_warnings

 Data processing

-reading genetic map from -m file
 --read 262 SNPs in the analysis interval+buffer region

-reading strand info for Panel 2 from -strand_g file
 --read strand info for 33 SNPs in the analysis region

-reading Panel 2 genotypes from -g file
 --detected 250 individuals
 --read 33 SNPs in the analysis interval+buffer region

-using -strand_g file to align Panel 2 allele labels
 --flipped strand at 14 out of 33 SNPs

-reading Panel 0 haplotypes from -h and -l files
 --detected 112 haplotypes
 --read 790 SNPs in the analysis interval+buffer region

-removing SNPs that violate the hierarchical data requirements
 --no SNPs removed

-removing reference-only SNPs from buffer region
 --removed 538 SNPs

-checking strand alignment between Panel 2 and Panel 0 by allele labels
 --flipped strand due to allele mismatch at 0 out of 33 SNPs in Panel 2

-aligning allele labels between panels

-removing non-aligned genotyped SNPs
 --removed 0 out of 27 SNPs with data in multiple panels

 Data summary

[type 0 = SNP in Panel 0 only]
[type 1 = SNP in Panel 1]
[type 2 = SNP in Panel 2 and all ref panels]
[type 3 = SNP in Panel 2 only]

-Upstream buffer region
 --0 type 0 SNPs
 --0 type 1 SNPs
 --10 type 2 SNPs
 --2 type 3 SNPs
 --12 total SNPs

-Downstream buffer region
 --0 type 0 SNPs
 --0 type 1 SNPs
 --5 type 2 SNPs
 --1 type 3 SNPs
 --6 total SNPs

-Analysis region (as defined by -int argument)
 --225 type 0 SNPs
 --0 type 1 SNPs
 --12 type 2 SNPs
 --3 type 3 SNPs
 --240 total SNPs

-Output file
 --225 type 0 SNPs
 --0 type 1 SNPs
 --12 type 2 SNPs
 --3 type 3 SNPs

-In total, 258 SNPs will be used in the analysis, including 27 Panel 2 SNPs

-making initial haplotype guesses for Panel 2 by phasing hets at random and imputing missing genotypes from allele freqs

-setting storage space
-setting mutation matrices
-setting switch rates

 Run parameters

        reference haplotypes: 112 [Panel 0]
           study individuals: 250 [Panel 2]
           sequence interval: [20400000,20500000]
                      buffer: 250 kb
                          Ne: 20000
           input call thresh: 0.900
     burn-in MCMC iterations: 10
       total MCMC iterations: 30 (20 used for inference)
      HMM states for phasing: 80 [Panel 2]
   HMM states for imputation: 112 [Panel 0->2]

 Run log

MCMC iteration [1/30]

MCMC iteration [2/30]

MCMC iteration [3/30]

-setting mutation matrices
-setting switch rates

MCMC iteration [4/30]

MCMC iteration [5/30]

MCMC iteration [6/30]

MCMC iteration [7/30]

MCMC iteration [8/30]

MCMC iteration [9/30]

MCMC iteration [10/30]

MCMC iteration [11/30]

MCMC iteration [12/30]

MCMC iteration [13/30]

MCMC iteration [14/30]

MCMC iteration [15/30]

MCMC iteration [16/30]

MCMC iteration [17/30]

MCMC iteration [18/30]

MCMC iteration [19/30]

MCMC iteration [20/30]

MCMC iteration [21/30]

MCMC iteration [22/30]

MCMC iteration [23/30]

MCMC iteration [24/30]

MCMC iteration [25/30]

MCMC iteration [26/30]

MCMC iteration [27/30]

MCMC iteration [28/30]

MCMC iteration [29/30]

MCMC iteration [30/30]

diploid sampling success rate: 0.988

haploid sampling success rate: (no haploid sampling performed)

 Imputation accuracy assessment

The table below is based on an internal cross-validation that is performed during each IMPUTE2 run. For this analysis, the program masks the genotypes of one variant at a time in the study data (Panel 2) and imputes the masked genotypes by using the remaining study and reference data. The imputed genotypes are then compared with the original genotypes to produce the concordance statistics shown in the table. You can learn more about this procedure and the contents of the table at http://mathgen.stats.ox.ac.uk/impute/concordance_table_description.html.

In the current analysis, IMPUTE2 masked, imputed, and evaluated 2985 genotypes that were called with high confidence (maximum probability >= 0.90) in the Panel 2 input file (-g or -known_haps_g).

When the masked study genotypes were imputed with reference data from Panel 0, the concordance between original and imputed genotypes was as follows:

  Interval  #Genotypes %Concordance         Interval  %Called %Concordance
  [0.0-0.1]          0          0.0         [ >= 0.0]   100.0         97.5
  [0.1-0.2]          0          0.0         [ >= 0.1]   100.0         97.5
  [0.2-0.3]          0          0.0         [ >= 0.2]   100.0         97.5
  [0.3-0.4]          0          0.0         [ >= 0.3]   100.0         97.5
  [0.4-0.5]          0          0.0         [ >= 0.4]   100.0         97.5
  [0.5-0.6]         15         86.7         [ >= 0.5]   100.0         97.5
  [0.6-0.7]          2        100.0         [ >= 0.6]    99.5         97.5
  [0.7-0.8]          5        100.0         [ >= 0.7]    99.4         97.5
  [0.8-0.9]         43         83.7         [ >= 0.8]    99.3         97.5
  [0.9-1.0]       2920         97.7         [ >= 0.9]    97.8         97.7

Have a nice day!

[user@helix mydir]$
Batch job on Biowulf

Set up a batch script along the following lines:


cd /data/user/dir1

module load impute

impute2 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.phased.impute2

Submit this job with:

sbatch --mem=4g  myjob.bat
In the job above, you are requesting 4 GB of memory. You should modify this number to suit your own jobs.
Swarm of jobs on Biowulf

Set up a swarm commmand file with one line for each Impute run. Example:

# this file is impute_swarm
cd /data/user/dir1; impute2 -ref_samp_out -m chr16.map -h chr16.haps  -l chr16.legend -g gtypes -s refstrand1  -Ne 11418 -int 5000000 5500000 -buffer 250 -k 10 -iter 10 -burnin 3  -o out1  -i info1  -r summary1
cd /data/user/dir2; impute2 -ref_samp_out -m chr26.map -h chr26.haps  -l chr26.legend -g gtypes -s refstrand2  -Ne 22428 -int 5000000 5500000 -buffer 250 -k 20 -iter 20 -burnin 3  -o out2  -i info2  -r summary2
cd /data/user/dir3; impute2 -ref_samp_out -m chr36.map -h chr36.haps  -l chr36.legend -g gtypes -s refstrand3  -Ne 33438 -int 5000000 5500000 -buffer 250 -k 30 -iter 30 -burnin 3  -o out3  -i info3  -r summary3
If each Impute process requires less than the default 4 GB of memory, submit this to the batch system with the command:
swarm -f cmdfile --module impute
If each Impute process requires more than 4 GB of memory, use
swarm -g # -f cmdfile --module impute
where '#' is the number of Gigabytes of memory required by each Impute process.

Note about parallelization

In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.

We therefore recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.

See this informative snippet from the Impute website for more details about dealing with whole chromosomes.

Interactive job on Biowulf

Allocate an appropriate number of interactive cores and run the job there. Example:

[susanc@biowulf ~]$sinteractive
salloc.exe: Granted job allocation 17356
slurm stepprolog here!
srun: error: x11: unable to connect node p20
                                            Begin slurm taskprolog!
End slurm taskprolog!

[susanc@p20 ~]$ cd /lscratch

[susanc@p20 lscratch]$ cp -r /usr/local/apps/gtool/0.7.5/example/ .

[susanc@p20 lscratch]$ cd example

[susanc@p20 example]$ module load gtool

[susanc@p20 example]$ gtool -S --g example.gen --s example.sample --og out.gen --os out.sample --sample_id sample_id.txt 
Number of input samples: 5
Number of output samples: 3
Number of input SNPs: 11
Number of output SNPs: 11
[susanc@p20 example]$ exit
slurm stepepilog here!
salloc.exe: Relinquishing job allocation 17356

Impute2 website
gtool website
Snptest website
qctool website