SynthDNM: a random-forest based classifier for robust de novo prediction of SNPs and indels.

SynthDNM is a random-forest based classifier that can be readily adapted to new sequencing or variant-calling pipelines by applying a flexible approach to constructing simulated training examples from real data. The optimized SynthDNM classifiers predict de novo SNPs and indels with robust accuracy across multiple methods of variant calling.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session on a GPU node:

[user@biowulf ~]$ sinteractive --mem=8g -c4 --gres=lscratch:20
[user@cn2379 ~]$ module load synthdnm
[+] Loading singularity  3.8.4  on cn2379
[+] Loading synthdnm  1.1.50
Basic usage:
[user@cn2379 ~]$ -h
usage: [-h] [--vcf_file VCF_FILE] --ped_file PED_FILE
                       [--region REGION] [--features_file FEATURES_FILE]
                       [--output_folder OUTPUT_FOLDER]
                       [--training_set_tsv TRAINING_SET_TSV]
                       {classify,make_training_set,train,grid_search} ...

SynthDNM: a de novo mutation classifier and training paradigm

positional arguments:
                        Available sub-commands
    classify            Classify DNMs using pre-trained classifiers.
    make_training_set   Make training set.
    train               Train classifiers
    grid_search         Randomized grid search across hyperparameters.

optional arguments:
  -h, --help            show this help message and exit
  --vcf_file VCF_FILE   VCF file input
  --ped_file PED_FILE   Pedigree file (.fam/.ped/.psam) input
  --region REGION       Interval ('{}' or '{}:{}-{}' in format of chr or
                        chr:start-end) on which to run training or
  --features_file FEATURES_FILE
                        Features file input
  --output_folder OUTPUT_FOLDER
                        Output folder for output files (if not used, then
                        output folder is set to 'synthdnm_output')
  --training_set_tsv TRAINING_SET_TSV
                        Training set file (created using make_training_set
[user@cn2379 ~]$ classify -h
usage: classify [-h] --clf_folder CLF_FOLDER

optional arguments:
  -h, --help            show this help message and exit
  --clf_folder CLF_FOLDER
                        Folder that contains the classifiers, which must be in
                        .pkl format (if not specified, will look for them in
                        the default data folder)
                        Only output the features file (without classifying
End the interactive session:
[user@cn2379 ~]$ exit
Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. For example:

set -e
module load synthdnm
cp $SDNM_DATA/* .
synthdnm -v tutorial.vcf -f tutorial.ped

Submit this job using the Slurm sbatch command.
