RTG Tools is a subset of RTG Core that includes several useful utilities for dealing with VCF files and sequence data. Notably, it includes the vcfeval command that performs sophisticated comparison of VCF files.
Allocate an interactive session and run the program. Sample session:
[user@biowulf ~]$ sinteractive
[user@cn0868 ~]$ module load rtg-tools
[+] Loading java 1.8.0_211 ...
[+] Loading Graphviz v 2.40.1 ...
[+] Loading rtg-tools 3.12.1
[user@cn0868 ~]$ rtg vcfeval -h
Usage: rtg vcfeval [OPTION]... -b FILE -c FILE -o DIR -t SDF
Evaluates called variants for genotype agreement with a baseline variant set irrespective of representational
differences. Outputs a weighted ROC file which can be viewed with rtg rocplot and VCF files containing false
positives (called variants not matched in the baseline), false negatives (baseline variants not matched in the
call set), and true positives (variants that match between the baseline and calls).
File Input/Output
-b, --baseline=FILE VCF file containing baseline variants
--bed-regions=FILE if set, only read VCF records that overlap the ranges contained in the
specified BED file
-c, --calls=FILE VCF file containing called variants
-e, --evaluation-regions=FILE if set, evaluate within regions contained in the supplied BED file, allowing
transborder matches. To be used for truth-set high-confidence regions or other
regions of interest where region boundary effects should be minimized
-o, --output=DIR directory for output
--region=REGION if set, only read VCF records within the specified range. The format is one of
<sequence_name>, <sequence_name>:<start>-<end>, <sequence_name>:<pos>+<length>
or <sequence_name>:<pos>~<padding>
-t, --template=SDF SDF of the reference genome the variants are called against
Filtering
--all-records use all records regardless of filter status (Default is to only process
variants passing filters)
--decompose decompose complex variants into smaller constituents to allow partial credit
--ref-overlap allow alleles to overlap where bases of either allele are same-as-ref (Default
is to only allow VCF anchor base overlap)
--sample=STRING the name of the sample to select. Use <baseline_sample>,<calls_sample> to
select different sample names for baseline and calls. (Required when using
multi-sample VCF files)
--sample-ploidy=INT expected ploidy of samples (Default is 2)
--squash-ploidy treat heterozygous genotypes as homozygous ALT in both baseline and calls, to
allow matches that ignore zygosity differences
Reporting
--at-precision=FLOAT output summary statistics where precision >= supplied value (Default is to
summarize at maximum F-measure)
--at-sensitivity=FLOAT output summary statistics where sensitivity >= supplied value (Default is to
summarize at maximum F-measure)
--no-roc do not produce ROCs
-m, --output-mode=STRING output reporting mode. Allowed values are [split, annotate, combine, ga4gh,
roc-only] (Default is split)
--roc-expr=STRING output ROC file for variants matching custom JavaScript expression. Use the
form <LABEL>=<EXPRESSION>. May be specified 0 or more times
--roc-regions=STRING output ROC file for variants overlapping custom regions supplied in BED file.
Use the form <LABEL>=<FILENAME>. May be specified 0 or more times
--roc-subset=STRING output ROC file for preset variant subset. Allowed values are [hom, het, snp,
non-snp, mnp, indel]. May be specified 0 or more times, or as a comma separated
list
-O, --sort-order=STRING the order in which to sort the ROC scores so that "good" scores come before
"bad" scores. Allowed values are [ascending, descending] (Default is
descending)
-f, --vcf-score-field=STRING the name of the VCF FORMAT field to use as the ROC score. Also valid are
"QUAL", "INFO.<name>" or "FORMAT.<name>" to select the named VCF FORMAT or INFO
field (Default is GQ)
Utility
-h, --help print help on command-line flag usage
-Z, --no-gzip do not gzip the output
-T, --threads=INT number of threads (Default is the number of available cores)
[user@cn0868 ~]$ demo-tools.sh rtg
Making directory for demo data: demo-tools
Checking RTG is executable
Checking if Graphviz is installed
RTG Tools Simulation and Variant Processing Demonstration
=========================================================
In this demo we will give you a taste of the capabilities of RTG with
a demonstration of simulated dataset generation and variant processing.
To start with we will use RTG simulation utilities to generate a
synthetic dataset from scratch:
* `genomesim` - simulate a reference genome
* `popsim` - simulate population variants
* `samplesim` - generate two founder individuals
* `childsim` - simulate offspring of the two founders
* `denovosim` - simulate de novo mutations in some of the offspring
* `readsim` - simulate next-gen sequencing of the individuals
We will also demonstrate RTG variant processing and other analysis with
the following commands:
* `mendelian` - check variants for Mendelian consistency
* `vcffilter` - VCF record filtering
* `vcfsubset` - Columnwise VCF alterations
* `vcfeval` - compare two VCF call sets for agreement
* `rocplot` - produce static or interactive ROC graphs
* `sdfstats` - output information about data stored in SDF
* `pedfilter` - convert pedigree information between PED and VCF
* `pedstats` - display summary pedigree information
Press enter to continue...
Genome Simulation
-----------------
First we simulate a reference genome by generating random DNA, in this
case 10 chromosomes with lengths between 40kb and 50kb. We will be
using fixed random number seeds during this demo in order to ensure we
have deterministic results. (We take reproducability seriously - so you
can be sure that you get repeatable results with RTG).
Press enter to continue...
...`
[user@cn0868 ~]$ exit
exit
salloc.exe: Relinquishing job allocation 49998864
[user@biowulf ~]$