High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
peddy on Biowulf & Helix


peddy is used to compare sex and familial relationships given in a PED file with those inferred from a VCF file. This is done by sampling 25000 sites plus chrX from the VCF file to estimate relatedness, heterozygosity, sex and ancestry. It uses data from the thousand genome project.

There may be multiple versions of peddy available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail peddy 

To select a module use

module load peddy/[version]

where [version] is the version of choice.

peddy is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set



Batch job on Biowulf

This example uses test data in /usr/local/apps/peddy/TEST_DATA

Create a batch script similar to the following example:

#! /bin/bash
# this file is peddy.batch
module load peddy || exit 1

cp -r $td .
peddy -p $SLURM_CPUS_PER_TASK --plot --prefix ceph-1463 data/ceph1463.vcf.gz data/ceph1463.ped

Submit to the queue with sbatch:

biowulf$ sbatch --cpus-per-task=6 --mem=5g peddy.batch

The script will generate a html file, some QC plots, and some csv files.

Swarm of jobs on Biowulf

Create a swarm command file similar to the following example:

# this file is peddy.swarm
peddy -p $SLURM_CPUS_PER_TASK --plot --prefix fam1 fam1/fam1.vcf.gz fam1/fam1.ped
peddy -p $SLURM_CPUS_PER_TASK --plot --prefix fam2 fam2/fam2.vcf.gz fam2/fam2.ped
peddy -p $SLURM_CPUS_PER_TASK --plot --prefix fam3 fam3/fam3.vcf.gz fam3/fam3.ped

And submit to the queue with swarm

biowulf$ swarm -f peddy.swarm -g 5 -t 6 --module peddy
Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

biowulf$ sinteractive 
node$ module load peddy
node$ cp -r /usr/local/apps/peddy/TEST_DATA/data .
node$ peddy -p $SLURM_CPUS_PER_TASK --plot --prefix ceph-1463 data/ceph1463.vcf.gz data/ceph1463.ped
ran in 6.6 seconds
loaded and subsetted thousand-genomes genotypes in 0.6 seconds
ran randomized PCA on thousand-genomes samples at 18984 sites in 2.5 seconds
Projected thousand-genomes genotypes and sample genotypes and predicted ancestry via SVM in 0.3 seconds
ran in 9.0 seconds
sex-check: 0 skipped / 814 kept
ran in 0.3 seconds
node$ ls -l
total 548K
-rw-rw---- 1 user group 2.0K Aug  8 09:57 ceph-1463.het_check.csv
-rw-rw---- 1 user group  22K Aug  8 09:57 ceph-1463.het_check.png
-rw-rw---- 1 user group 213K Aug  8 09:57 ceph-1463.html
-rw-rw---- 1 user group 116K Aug  8 09:57 ceph-1463.pca_check.png
-rw-rw---- 1 user group  13K Aug  8 09:56 ceph-1463.ped_check.csv
-rw-rw---- 1 user group 110K Aug  8 09:56 ceph-1463.ped_check.png
-rw-rw---- 1 user group 1.7K Aug  8 09:57 ceph-1463.peddy.ped
-rw-rw---- 1 user group  835 Aug  8 09:57 ceph-1463.sex_check.csv
-rw-rw---- 1 user group  27K Aug  8 09:57 ceph-1463.sex_check.png
drwxrwxr-x 2 user group 4.0K Aug  8 09:53 data
node$ exit