High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Impute, qctool, gtool and snptest on Biowulf & Helix

IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009:

QCTOOL is a command-line utility program for basic quality control of GWAS datasets. It supports the same file formats used by the WTCCC studies, as well as the binary file format described on the qctool webpage and the Variant Call Format, and is designed to work seamlessly with SNPTEST and related tools

GTOOL is a program for transforming sets of genotype data for use with the programs SNPTEST and IMPUTE. GTOOL can be used to:

SNPTEST is a program for the analysis of single SNP association in genome-wide studies. The tests implemented include

Impute, qctool, gtool and snptest were all developed at the University of Oxford.

To add all these executables to your path, use

module load impute qctool gtool snptest  
Of course, you can use 'module avail programname' to see available versions of any of these programs.
[susanc@biowulf ~]$ module avail impute snptest

---------------------------- /usr/local/lmod/modulefiles ----------------------------------------
   impute/2.3.2    snptest/2.5

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

On Helix

Sample impute job:

[user@helix mydir]$ cp -R /usr/local/apps/impute/Example .

[user@helix mydir]$ cd Example

[user@helix mydir]$ module load impute

[user@helix mydir]$ impute2 -ref_samp_out -m ./chr16.map -h ./chr16.haps \
  -l ./chr16.legend -g ./chr16.reference.gtypes -s ./chr16.reference.strand \
  -Ne 11418 -int 5000000 5500000 -buffer 250 -k 10 -iter 10 -burnin 3 \
  -o ./Results/chr16.multi_panel.ref_gtypes.impute2 \
  -i ./Results/chr16.multi_panel.ref_gtypes.impute2.info \
  -r ./Results/chr16.multi_panel.ref_gtypes.impute2.summary

The seed for the random number generator is 1115038504.

Command-line input: impute2 -ref_samp_out -m ./chr16.map -h ./chr16.haps -l ./chr16.legend -g ./chr16.reference.gtypes -s ./chr16.reference.strand -Ne 11418 -int 5000000 5500000 -buffer 250 -k 10 -iter 10 -burnin 3 -o ./Results/chr16.multi_panel.ref_gtypes.impute2 -i ./Results/chr16.multi_panel.ref_gtypes.impute2.info -r ./Results/chr16.multi_panel.ref_gtypes.impute2.summary


======================
 IMPUTE version 2.0.3 
======================

Copyright 2008 Bryan Howie, Peter Donnelly, and Jonathan Marchini
Please see the LICENCE file included with this program for conditions of use.

    haplotypes file : ./chr16.haps
        legend file : ./chr16.legend
 ref genotypes file : NULL
ref gen strand file : NULL
     genotypes file : ./chr16.reference.gtypes
        strand file : ./chr16.reference.strand
           map file : ./chr16.map
 excluded SNPs file : NULL
 included SNPs file : NULL
    ref samp infile : NULL
        output file : ./Results/chr16.multi_panel.ref_gtypes.impute2
          info file : ./Results/chr16.multi_panel.ref_gtypes.impute2.info
       summary file : ./Results/chr16.multi_panel.ref_gtypes.impute2.summary

imputation interval : [5000000,5500000]

reading genetic map...done

reading inference panel genotypes
 # inference panel individuals = 30
 # SNPs with genotypes read in = 430

reading haplotypes
 # haplotypes = 120
 # SNPs read in = 1202

No initial haplotype guesses file was provided for the inference panel genotypes; now phasing hets at random and imputing missing genotypes from allele freqs.

Summary :
139 SNPs in left-hand buffer region
356 SNPs in right-hand buffer region
277 type 0 SNPs will be in output file (type 0 = SNP in reference haps file only)
0 type 1 SNPs will be in output file (type 1 = SNP in reference gens file)
0 type 2 SNPs will be in output file (type 2 = SNP in inference gens and all reference files)
0 type 3 SNPs will be in output file (type 3 = SNP in inference gens file only)
277 SNPs will be in output file in total
1202 SNPs in total

-using strand file to orientate strand in inference genotype panel
 --flipped strand at 237 genotyped SNPs in inference panel out of a total of 430
-aligning allele labels of haplotypes, reference genotypes, and inference genotypes
-removing non-aligned genotyped SNPs
 --removing 0 genotyped SNPs out of a total of 430

setting weights...done
setting storage space...done
setting mutation matrices...done
setting switch rates...done

          haps in -h file : 120
     indiv in -g_ref file : 0
         indiv in -g file : 30
                 interval : [5000000, 5500000]
                   buffer : 250
                       Ne : 11418
              call thresh : 0.900
          MCMC iterations : 10
       burn-in iterations : 3
   states for phasing (k) : 10


 MCMC iteration [1/10]
 updating inf indiv [30/30] [dip]
 MCMC iteration [2/10]
 --- RESETTING MODEL PARAMETERS FOR INFORMED CONDITIONING --- 
setting mutation matrices...done
setting switch rates...done
 updating inf indiv [30/30] [dip]
 MCMC iteration [3/10]
 updating inf indiv [30/30] [dip]
 MCMC iteration [4/10]
 updating inf indiv [30/30] [dip] [hap 0] [hap 0]
 MCMC iteration [5/10]
 updating inf indiv [30/30] [dip] [hap 0] [hap 0]
 MCMC iteration [6/10]
 updating inf indiv [30/30] [dip] [hap 0] [hap 0]
 MCMC iteration [7/10]
 updating inf indiv [30/30] [dip] [hap 0] [hap 0]
 MCMC iteration [8/10]
 updating inf indiv [30/30] [dip] [hap 0] [hap 0]
 MCMC iteration [9/10]
 updating inf indiv [30/30] [dip] [hap 0] [hap 0]
 MCMC iteration [10/10]
 updating inf indiv [30/30] [dip] [hap 0] [hap 0]

dip sampling success rate: 0.949

hap sampling success rate: (no haploid sampling performed)


--------------------------------
 Imputation accuracy assessment 
--------------------------------

This breakdown is based on an internal leave-one-out validation at SNPs with genotypes in the -g input file, using only the input genotypes with maximum call probabilities exceeding a threshold of 0.90. There are 5106 such genotypes in the current input file.

Accuracy assessment for imputation of type 0 SNPs (those with data in the haploid reference panel only) .The maximum imputed genotype calls are distributed as follows:
  Interval  #Genotypes %Concordance         Interval  %Called %Concordance
  [0.0-0.1]          0          0.0         [ >= 0.0]   100.0         96.7
  [0.1-0.2]          0          0.0         [ >= 0.1]   100.0         96.7
  [0.2-0.3]          0          0.0         [ >= 0.2]   100.0         96.7
  [0.3-0.4]          0          0.0         [ >= 0.3]   100.0         96.7
  [0.4-0.5]         10         40.0         [ >= 0.4]   100.0         96.7

[user@helix mydir]$ ls Results
chr16.multi_panel.ref_gtypes.impute2
chr16.multi_panel.ref_gtypes.impute2.info
chr16.multi_panel.ref_gtypes.impute2.summary
chr16.multi_panel.ref_gtypes.impute2_refsamp1.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp10.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp2.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp3.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp4.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp5.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp6.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp7.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp8.gz
chr16.multi_panel.ref_gtypes.impute2_refsamp9.gz
[user@helix mydir]$
Batch job on Biowulf

Set up a batch script along the following lines:

#!/bin/bash

cd /data/user/dir1

module load impute

impute2 -ref_samp_out -m chr16.map -h chr16.haps  -l chr16.legend \
	-g gtypes -s refstrand1  -Ne 11418 -int 5000000 5500000 -buffer 250 \
	-k 10 -iter 10 -burnin 3  -o out1  -i info1  -r summary1

Submit this job with:

sbatch --mem=4g  myjob.bat
In the job above, you are requesting 4 GB of memory. You should modify this number to suit your own jobs.
Swarm of jobs on Biowulf

Set up a swarm commmand file with one line for each Impute run. Example:

# this file is impute_swarm
cd /data/user/dir1; impute2 -ref_samp_out -m chr16.map -h chr16.haps  -l chr16.legend -g gtypes -s refstrand1  -Ne 11418 -int 5000000 5500000 -buffer 250 -k 10 -iter 10 -burnin 3  -o out1  -i info1  -r summary1
cd /data/user/dir2; impute2 -ref_samp_out -m chr26.map -h chr26.haps  -l chr26.legend -g gtypes -s refstrand2  -Ne 22428 -int 5000000 5500000 -buffer 250 -k 20 -iter 20 -burnin 3  -o out2  -i info2  -r summary2
cd /data/user/dir3; impute2 -ref_samp_out -m chr36.map -h chr36.haps  -l chr36.legend -g gtypes -s refstrand3  -Ne 33438 -int 5000000 5500000 -buffer 250 -k 30 -iter 30 -burnin 3  -o out3  -i info3  -r summary3
[...]
If each Impute process requires less than the default 4 GB of memory, submit this to the batch system with the command:
swarm -f cmdfile --module impute
If each Impute process requires more than 4 GB of memory, use
swarm -g # -f cmdfile --module impute
where '#' is the number of Gigabytes of memory required by each Impute process.

Note about parallelization

In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.

We therefore recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.

See this informative snippet from the Impute website for more details about dealing with whole chromosomes.

Interactive job on Biowulf

Allocate an appropriate number of interactive cores and run the job there. Example:

[susanc@biowulf ~]$sinteractive
salloc.exe: Granted job allocation 17356
slurm stepprolog here!
srun: error: x11: unable to connect node p20
                                            Begin slurm taskprolog!
End slurm taskprolog!

[susanc@p20 ~]$ cd /lscratch

[susanc@p20 lscratch]$ cp -r /usr/local/apps/gtool/0.7.5/example/ .

[susanc@p20 lscratch]$ cd example

[susanc@p20 example]$ module load gtool

[susanc@p20 example]$ gtool -S --g example.gen --s example.sample --og out.gen --os out.sample --sample_id sample_id.txt 
Number of input samples: 5
Samples...
Number of output samples: 3
Gen...
Number of input SNPs: 11
Number of output SNPs: 11
[susanc@p20 example]$ exit
exit
slurm stepepilog here!
salloc.exe: Relinquishing job allocation 17356
Documentation

Impute2 website
gtool website
Snptest website
qctool website