High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
GQT on Biowulf & Helix

Description

From the GQT documentation:

Genotype Query Tools (GQT) is command line software and a C API for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduce computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard "variant-centric" indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals.

There may be multiple versions of GQT available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail gqt 

To select a module use

module load gqt/[version]

where [version] is the version of choice.

Environment variables set

References

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described below

biowulf$ sinteractive --mem=5g
node$ module load gqt bcftools
node$ cp $GQT_TEST_DATA/* .
node$ ls -lh
total 21M
-rw-rw-r-- 1 wresch staff 116K Mar  7 09:30 1kg.phase3.ped
-rw-rw-r-- 1 wresch staff  20M Mar  7 09:30 chr11.11q14.3.bcf
node$ # index the bcf file
node$ bcftools index chr11.11q14.3.bcf

Now create the GQT index

node$ gqt convert bcf -i chr11.11q14.3.bcf
Attempting to autodetect number of variants and samples from chr11.11q14.3.bcf
Number of variants:129852       Number of samples:2504
Extracting genotypes and metadata...........Done
Sorting genotypes and metadata...........Done
Compressing metadata...........Done
Rotating genotypes...........Done
Compressing genotypes...........Done

node$ ls -lh
total 42M
-rw-rw-r-- 1 wresch staff 116K Mar  7 09:30 1kg.phase3.ped
-rw-rw-r-- 1 wresch staff  20M Mar  7 09:30 chr11.11q14.3.bcf
-rw-rw-r-- 1 wresch staff 4.2M Mar  7 09:31 chr11.11q14.3.bcf.bim
-rw-rw-r-- 1 wresch staff 3.5K Mar  7 09:30 chr11.11q14.3.bcf.csi
-rw-rw-r-- 1 wresch staff  17M Mar  7 09:32 chr11.11q14.3.bcf.gqt
-rw-rw-r-- 1 wresch staff 508K Mar  7 09:31 chr11.11q14.3.bcf.vid

Now create a GQT sample database with an (optional) .ped file. In this example, the database will be named 1kg.phase3.ped.db

node$ gqt convert ped -i chr11.11q14.3.bcf -p 1kg.phase3.ped
Creating sample database 1kg.phase3.ped
Adding the following fields from 1kg.phase3.ped
Family_ID       TEXT
Individual_ID   TEXT
Paternal_ID     TEXT
[...snip...]

Now the GQT index can be queried. For example: Find all variants that have a frequency of more than 10% among the GBR population:

node$ gqt query \
    -i chr11.11q14.3.bcf.gqt \
    -d 1kg.phase3.ped.db \
    -p "Gender = 1 and Population ='GBR'" \
    -g "maf()>0.1" \
    > GBR.vcf

End the interactive session

node$ exit

biowulf$

For more details see the GitHub page and the paper.

Batch job on Biowulf

Create a batch script similar to the following example assuming that a GQT index already exists:

#! /bin/bash
# this file is gqt.batch

module load gqt || exit 1
for pop in GWD YRI IBS TSI CHS JPT PUR CHB GIH ITU STU; do
    gqt query \
        -i chr11.11q14.3.bcf.gqt \
        -d 1kg.phase3.ped.db \
        -p "Gender = 1 and Population ='GBR'" \
        -g "maf()>0.1" \
        > ${pop}.vcf
done

Submit to the queue with sbatch:

biowulf$ sbatch --mem=4g gqt.batch