High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Plink on Biowulf & Helix

Plink is a whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.

PLINK (one syllable) is being developed by Shaun Purcell at the Center for Human Genetic Research (CHGR), Massachusetts General Hospital (MGH), and the Broad Institute of Harvard & MIT, with the support of others.

PLINK is not a parallel program. Single PLINK jobs can be run interactively on the Biowulf interactive nodes or Helix, or as a batch job on Biowulf. If you have multiple PLINK jobs to run, the swarm utility is the easiest way to run them.

Available versions of Plink can be seen and loaded by using the modules commands, as in the example below:

[user@biowulf]$ module avail plink

---------------------- /usr/local/lmod/modulefiles ------------------------------
   plink/1.06    plink/1.07 (D)
  Where:
   (D):  Default Module


[user@biowulf]$ module load plink

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) plink/1.07

The utility FCgene, a format converting tool for genotyped data (e.g. PLINK-MACH, MACH-PLINK) is also available. Type 'module load fcgene' to add the binary to your path, and then 'fcgene' to run it.

Running Plink on Helix

Sample session

First put the *.ped and *.map input files etc in the current directory.
% plink --file test % plink --file test --freq % plink --file test --assoc % plink --file test --make-bed /home/user/plink/t1 @----------------------------------------------------------@ | PLINK! | v1.03 | 04/Jun/2008 | |----------------------------------------------------------| | (C) 2008 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Skipping web check... [ --noweb ] Writing this text to log file [ plink.log ] Analysis started: Wed Jul 9 13:43:13 2008 Options in effect: --noweb --file test1 2 (of 2) markers to be included from [ test1.map ] 6 individuals read from [ test1.ped ] 6 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 3 cases, 3 controls and 0 missing 6 males, 0 females, and 0 of unspecified sex Before frequency and genotyping pruning, there are 2 SNPs 6 founders and 0 non-founders found 0 of 6 individuals removed for low genotyping ( MIND > 0.1 ) Total genotyping rate in remaining individuals is 1 0 SNPs failed missingness test ( GENO > 0.1 ) 0 SNPs failed frequency test ( MAF < 0.01 ) After frequency and genotyping pruning, there are 2 SNPs After filtering, 3 cases, 3 controls and 0 missing After filtering, 6 males, 0 females, and 0 of unspecified sex Analysis finished: Wed Jul 9 13:43:13 2008 [...etc...]
Submitting a swarm of Plink jobs on Biowulf

The swarm program is a convenient way to submit large numbers of jobs all at once instead of manually submitting them one by one.

Create a swarm command file along the lines of the one below:

cd /data/$USER/myseqs; plink --noweb --ped file1.ped --map file1.map --assoc
cd /data/$USER/myseqs; plink --noweb --ped file2.ped --map file2.map --assoc
cd /data/$USER/myseqs; plink --noweb --ped file3.ped --map file3.map --assoc
[...etc...]

Submit this swarm with:

swarm -f cmdfile --module plink/1.0.7

By default, each line of the commands above will be executed on 1 core (2 CPUs) of a node and can use up to 4 GB of memory. If each plink command requires more than 4 GB of memory, you must specify the memory required using the -g # flag to swarm. For example, if each command requires 10 GB of memory, submit the swarm with:

swarm -g 10 -f cmdfile --module plink/1.0.7

For more information regarding running swarm, see swarm.html

Submitting a single Plink job on Biowulf

Single plink jobs would typically be submitted only for debugging purposes.

1. Create a script file which contains the Plink commands as below:

---------- /data/user/plink/run1/plink.bat --------------
#!/bin/bash -v

cd /data/$USER/plink/t1
plink --noweb --file test1
plink --noweb --file test1 --freq
plink --noweb --file test1 --assoc
plink --noweb --file test1 --make-bed
----------------- end of script ----------------------

2. Now submit the script using the 'sbatch' command, e.g.

sbatch --mem=5g plink.bat
where, of course, the memory requirement should be appropriately set for your own job.

Documentation

http://pngu.mgh.harvard.edu/~purcell/plink/