High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Platypus on Biowulf and Helix

Platypus (http://www.well.ox.ac.uk/platypus) is a tool for variant-detection in high-throughput sequencing data.

Module Environment for Platypus

Before using platypus, you must add the platypus environment module and the other modules it uses into your shell environment. This is most easily done by using the module commands, as in the example below:

[user@helix]$ module avail platypus                   (see what versions are available)

-------------------- /usr/local/Modules/3.2.9/modulefiles --------------------
platypus/ platypus/0.8.1

[user@helix]$ module load platypus                    (load the default version)
[user@helix]$ module list                             (see what versions are loaded)
Currently Loaded Modulefiles:
  1) python/2.7.9   2) platypus/0.8.1
[user@helix]$ module unload platypus                  (unload platypus)

Running program platypus

The program names "platypus" and "Platypus" are synonymous with "Python.py" which is a python script. With the platypus environment module loaded, you can use any one of the three names to run program "Platypus.py". In particular,

  1. The following six program invocations are equivalent to each other:

    $ Platypus.py callVariants [options] $ python Platypus.py callVariants [options]
    $ Platypus callVariants [options] $ python Platypus callVariants [options]
    $ platypus callVariants [options] $ python platypus callVariants [options]

  2. The following six program invocations are equivalent to each other:

    $ Platypus.py continueCalling [options] $ python Platypus.py continueCalling [options]
    $ Platypus continueCalling [options] $ python Platypus continueCalling [options]
    $ platypus continueCalling [options] $ python platypus continueCalling [options]

You can see a list of all the possible input options by means of the following commands:

[user@helix]$ module list
Currently Loaded Modulefiles:
  1) python/2.7.9   2) platypus/0.8.1
[user@helix]$ platypus callVariants --help
Usage: platypus [options]

  -h, --help            show this help message and exit
   :::: many, many options ::::
[user@helix]$ platypus continueCalling --help
Usage: platypus [options]

  -h, --help         show this help message and exit
  --vcfFile=VCFFILE  platypus will start again from the nearest possible co-
                     ordinate to the end of this VCF. This must be a VCF
                     produced by platypus

However, in most cases the default parameter values should be fine, and you will only need to specify the --bamFiles, --refFile, and --output arguments.

By default, if you do not specify a region or regions or interest, platypus will run through all the data in your BAM files. The --regions argument can be used to specify regions of interest.

Running in Variant-Calling Mode

The standard way of running platypus is to use it to detect variants in one or more BAM files. Variants are detected by comparing the BAM reads with a reference sequence. This can be done using the following command:

platypus callVariants --bamFiles=DATA.bam --regions=chr20 \
                      --output=test.vcf --refFile=GENOME.fa

where the input BAM files, and the genome reference must be indexed using samtools, or a program that produces compatible index files.

Variant calling with additional Reference calling

platypus can also output reference calls. When a region is well covered by reads, and there is no evidence of variation from the reference, a 'REFCALL' block will be output. This can be useful if you want to exclude the possibility of any variation in a specific region. Each REFCALL block comes with an associated quality score, in the 'QUAL' column. If this is high, then there is good support for the reference sequence; if this score is low, then there is some evidence for variation, but not enough for platypus to make an explicit variant call.

To enable reference-calling, use the '--outputRefCalls=1' flag:

platypus callVariants --bamFiles=DATA.bam --regions=chr20 --output=test.vcf \
                      --refFile=GENOME.fa --outputRefCalls=1

Running in Genotyping Mode

platypus can take as input a compressed, indexed VCF file, and genotype the BAMs for all alleles in the compressed, indexed VCF.

To run platypus in genotyping mode, use the following command:

platypus callVariants --bamFiles=DATA.bam --regions=chr20 --output=test.vcf \
                      --refFile=GENOME.fa --source=INPUT.vcf.gz \
                      --minPosterior=0 --getVariantsFromBAMs=0

You must use as input a VCF that has been compressed, with bgzip, and indexed with tabix. To create this, do the following:

[user@helix]$ bgzip file.vcf             # Converts file.vcf into file.vcf.gz
[user@helix]$ tabix -p vcf file.vcf.gz   # Creates the index file.vcf.gz.tbi

Known issues with genotyping

Sometimes variants in the input VCF don't get genotypes in the output VCF. This generally only happens in complex regions, with many variant candidates, when the particular variant is not well supported by the data, typically when there are lots of indels close together.

Running in Combined Genotyping/Calling Mode

If required, platypus can detect variants from the input BAM files, as well as using an input allele list. This can be done as follows:

platypus callVariants --bamFiles=DATA.bam --regions=chr20 --output=test.vcf \
                      --refFile=GENOME.fa --source=INPUT.vcf.gz

If you want '0/0', i.e. reference genotypes to be reported, then set the argument --minPosterior=0, and the output VCF will then also contain records for variants candidates which were genotyped '0/0'.

Main command-line arguments to platypus

OptionDescription of Flag or Parameter
General variant-calling arguments
--output=*FileName* Name of output VCF file
--logFileName=*FileName* Name of output log file (default is 'log.txt')
--refFile=*FileName* Name of indexed FASTA reference file
--regions=*RegionList* List of regions to search.
The format of a region is e.g. 'chrX:1000-100000'.
*RegionList* can be a comma-separated list of regions, or
a text file of regions in the same format.
--bamFiles=*BAMfileList* List of BAM files. Currently must be either a comma-separated list of indexed BAMs, or the name of a text file with a list of input BAMs, one per line.
--bufferSize=*Kbytes* Specifies how much (as a genomic region) of the BAMs to buffer into memory at any time (default is 100kb)
--minReads=*Number* The minimum number of reads required to support a variants
(default is 2)
--maxReads=*Number* The maximium allowed coverage in a region of --bufferSize *Kbytes*. If there are more reads than *Number*, the region will be skipped, and a warning issued (default is 5,000,000)
--maxVariants=*Number* The maximium number of variants to consider in a given window (default is 8)
--verbosity=*Number* Level of logging. Set to 3 for debug output (default is 2)
--source=*FileName* Name of input VCF file to use for genotyping (default is 'None')
--nCPU=*Number* Number of processors to use. If *Number* > 1, then multiple processes are run in parallel, and the output is combined at the end
(default is 1)
--getVariantsFromBAMs=*flag* If set to 1, variant candidates will be generated from BAMs as well as any other inputs ( default is 1)
--genSNPs=*flag* If set to 1, SNP candidates will be considered (default is 1)
--genIndels=*flag* If set to 1, Indel candidates will be considered (default is 1)
--minPosterior=*Number* Only variants with posterior >= *Number* will be output to the VCF. (default is 5; phred-scaled)
--maxSize=*BaseCount* Only variant candidates smaller than *BaseCount* will be considered. Anything larger is filtered out (default is 1500 bases)
--minFlank=*BaseCount* Variant candidates must be > *BaseCount* bases from the end of a read to be considered (default is 10 bases).
Arguments for BAM data filtering.
--minMapQual=*Number* Minimum mapping quality of reads to consider. Any reads with map qual below *Number* are ignored (default is 20)
--minBaseQual=*Number* Minimum allowed base-calling quality. Any bases with base qual below *Number* are ignored in SNP-calling (default is 20)
--minGoodQualBases=*Number* Minimum number of bases with quality above --minBaseQual *Number* that must be present for a read to be used
(default is 20).
--filterDuplicates=*flag* If set to 1 then skip duplicate read-pairs based on the start and end position of both reads (default is 1).
--filterReadsWithUnmappedMates=*flag* If set to 1, filters reads whose mates are un-mapped
(default is 1).
--filterReadsWithDistantMates=*flag* If set to 1, filters reads whose mates are mapped far away, using the IS_PROPER_PAIR flag in the BAM record. (Default is 1).
--filterReadPairsWithSmallInserts=*flag* If set to 1, filters read-pairs with insert sizes < one read length (default is 1).
Arguments for local assembly
--assemble=*flag* Set to 1 to turn on local assembly. (default is 0)
Arguments for referene calling
--outputRefCalls=*flag* If set to 1, will output reference call blocks (default is 0).

These are not all the possible arguments. To get a complete list, run "python platypus.py callVariants --help". But anything not listed here should generally be left alone.

VCF output

The VCF files output by platypus contain a number of annotations, some of which are deprecated and will be removed in future releases.

INFO fields

FR Estimated haplotype population frequency
TC Total coverage at this locus
TCR Total reverse strand coverage at this locus
TCF Total forward strand coverage at this locus
NR Total number of reverse reads containing this variant
NF Total number of forward reads containing this variant
TR Total number of reads containing this variant
HP Homopolmer run length in 20 bases either side of variant position
PP Posterior probability (phred scaled) that this variant segregates in the data.
SC Genomic sequence 10 bases either side of variant position
MMLQ Median minimum base quality for bases around variant.
If this is low (<=10) then the variant is only supported by low-quality reads.
WS Start position of window in which variant was called
WE End position of window in which variant was called
SOURCE Flag to say if the variant was found by Platypus, the Assembler, or an input file
END End position of reference call-block. Only used for reference calling
Sb Pval Binomial P-value for strand bias test
MQ Root mean square of mapping qualities of reads at the variant position
QD Variant-quality/read-depth for this variant
SC Genomic sequence 10 bases either side of variant position
BRF Fraction of reads around this variant that failed filters
HapScore The number of haplotypes supported in the calling window
Size Size of reference call block. Only used for reference calling

FILTER fields

strandBias Variant fails strand-bias filter
alleleBias Variant fails allele-bias filter
badReads Variant is supported only by low-quality reads
Q20 Variant-call has low posterior score ( < phred-score of 20)
GOF Variant-call fails goodness-of-fit test
PASS Variant passes all filters
QD Ratio of variant quality to number of supporting reads is low
SC Sequence context surrounding variant has low complexity

For multi-allelic sites, the filter field is based on information from the best variant call at that site, so there may be variants which fail filters at these sites. VCF does not allow allele-specific filter values.

FORMAT fields

GT Un-phased genotype calls
GL Genotype log-likelihoods (natural log) for AA,AB and BB genotypes,
where A = ref and B = variant. Only applicable for bi-allelic sites
GOF Phred-scaled goodness-of-fit score for the genotype call
GQ Phred-scaled quality score for the genotype call
NR Number of reads covering variant position in this sample
NV Number of reads at variant position which support the called variant in this sample

Known Issues

When genotyping, if there are many variants close together, then not all will be reported in the output VCF. This is only true for variants which are not supported by the data. Variants that are well supported will have genotype calls.

Using program platypus on the biowulf cluster

When running the platypus program, or any other platypus command, on the biowulf cluster, you must have already put the platypus environment in place by running the command "module load platypus". In particular, this means that

  1. For an interactive node session, you must run "module load platypus" before attempting any platypus commands.
  2. For a single platypus batch job, you must include "module load platypus" in your job script before any line running a platypus job.
  3. For a swarm of platypus jobs, you must include "module load platypus" in your swarm command file before any line running a platypus job.

Running a single Platypus batch job on Biowulf

(See the section of the same name for application samtools).

Running a swarm of Platypus jobs

(See the section of the same name for application samtools).

For more information regarding running swarm, see swarm.html

Running an interactive Platypus job on Biowulf

(See the section of the same name for application samtools).