OncodriveCLUSTL: a sequence-based clustering method to identify cancer drivers

OncodriveCLUSTL is a sequence-based clustering algorithm to detect significant clustering signals across genomic regions. It is based based on a local background model derived from the simulation of mutations accounting for the composition of trior penta-nucleotide context substitutions observed in the cohort under study.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=75g -c20 --gres=lscratch:20
[user@cn3335 ~]$ module load oncodriveCLUSTL
[+] Loading singularity  3.10.5  on cn4338
[+] Loading oncodriveCLUSTL  1.1.1
[user@cn3335 ~]$ oncodriveclustl -h
Usage: oncodriveclustl [OPTIONS]

  OncodriveCLUSTL is a sequence based clustering method to identify cancer
  drivers across the genome

  Args:     input_file (str): path to mutations file     regions_file (str):
  path to input genomic coordinates file     output_directory(str): path to
  output directory. Output files will be generated in it.
  input_signature (str): path to file containing input context based
  mutational probabilities.         By default (when no input signatures),
  OncodriveCLUSTL will calculate them from the mutations input file.
  elements_file (str): path to file containing one element per row
  (optional) to analyzed the listed elements.         By default,
  OncodriveCLUSTL analyzes all genomic elements contained in `regions_file`.
  elements (str): genomic element symbol (optional). The analysis will be
  performed only on the specified GEs.     genome (str): genome to use:
  'hg38', 'hg19', 'mm10', 'c3h', 'car', 'cast' and 'f344'
  element_mutations (int): minimum number of mutations per genomic element
  to undertake analysis     cluster_mutations (int): minimum number of
  mutations to define a cluster     smooth_window (int): Tukey kernel
  smoothing window length     cluster_window (int): clustering window length
  kmer (int): context nucleotides to calculate the mutational probabilities
  (trinucleotides or pentanucleotides)     n_simulations (int): number of
  simulations     simulation_mode (str): simulation mode
  simulation_window (int): window length to simulate mutations
  signature_calculation (str): signature calculation, mutation frequencies
  (default) or mutation counts         normalized by k-mer region counts
  signature_group (str): header of the column to group signatures. One
  signature will be computed for each group     cores (int): number of CPUs
  to use     seed (int): seed     log_level (str): verbosity of the logger
  concatenate (bool): flag to calculate clustering on collapsed genomic
  regions (e.g., coding regions in a gene)     clustplot (bool): flag to
  generate a needle plot with clusters for an element     qqplot (bool):
  flat to generate a quantile-quantile (QQ) plot for a dataset     gzip
  (bool): flag to generate GZIP compressed output files

  Returns:     None

Options:
  -i, --input-file PATH           File containing somatic mutations
                                  [required]
  -r, --regions-file PATH         File with the genomic regions to analyze
                                  [required]
  -o, --output-directory TEXT     Output directory to be created  [required]
  -sig, --input-signature PATH    File containing input context based
                                  mutational probabilities (signature)
  -ef, --elements-file PATH       File with the symbols of the elements to
                                  analyze
  -e, --elements TEXT             Symbol of the element(s) to analyze
  -g, --genome [hg38|hg19|mm10|c3h|car|cast|f344]
                                  Genome to use
  -emut, --element-mutations INTEGER
                                  Cutoff of element mutations. Default is 2
  -cmut, --cluster-mutations INTEGER
                                  Cutoff of cluster mutations. Default is 2
  -sw, --smooth-window INTEGER RANGE
                                  Smoothing window. Default is 11
  -cw, --cluster-window INTEGER RANGE
                                  Cluster window. Default is 11
  -kmer, --kmer [3|5]             K-mer nucleotide context
  -n, --n-simulations INTEGER     number of simulations. Default is 1000
  -sim, --simulation-mode [mutation_centered|region_restricted]
                                  Simulation mode
  -simw, --simulation-window INTEGER RANGE
                                  Simulation window. Default is 31
  -sigcalc, --signature-calculation [frequencies|region_normalized]
                                  Signature calculation: mutation frequencies
                                  (default) or k-mer mutation counts
                                  normalized by k-mer region counts
  -siggroup, --signature-group [SIGNATURE|SAMPLE|CANCER_TYPE]
                                  Header of the column to group signatures
                                  calculation
  -c, --cores INTEGER RANGE       Number of cores to use in the computation.
                                  By default it will use all the available
                                  cores.
  --seed INTEGER                  Seed to use in the simulations
  --log-level [debug|info|warning|error|critical]
                                  Verbosity of the logger
  --concatenate                   Calculate clustering on concatenated genomic
                                  regions (e.g., exons in coding sequences)
  --clustplot                     Generate a needle plot with clusters for an
                                  element
  --qqplot                        Generate a quantile-quantile (QQ) plot for a
                                  dataset
  --gzip                          Gzip compress files
  -h, --help                      Show this message and exit.
Copy sample data to the current folder:
[user@cn3335 ~]$ cp -r $ODCLUSTL_DATA/* .
Now let's run oncodriveCLUSTL on the sample data. According to the the oncodriveCLUSTL documentation, "The first time that you run OncodriveCLUSTL with a given reference genome, it will download it from our servers. By default the downloaded datasets go to ~/.bgdata. If you want to move these datasets to another folder you have to define the system environment variable BGDATA_LOCAL with an export command."
[user@cn3335 ~]$ oncodriveclustl -i PAAD.tsv.gz -r cds.hg19.regions.gz -o test_output
2023-02-02 08:32:50,073 [110140] INFO     root: OncodriveCLUSTL
2023-02-02 08:32:50,073 [110140] INFO     root:
input_file: PAAD.tsv.gz
regions_file: cds.hg19.regions.gz
input_signature: None
output_directory: test_output
genome: hg19
element_mutations: 2
cluster_mutations: 2
concatenate: False
smooth_window: 11
cluster_window: 11
k-mer: 3
simulation_mode: mutation_centered
simulation_window: 31
n_simulations: 1000
signature_calculation: frequencies
signature_group: None
cores: 128
gzip: False
seed: None
2023-02-02 08:32:50,075 [110140] INFO     root: Initializing OncodriveCLUSTL...
2023-02-02 08:32:50,077 [110140] WARNING  root:
Running with default simulating, smoothing and clustering OncodriveCLUSTL parameters. Default parameters may not be optimal for your data.
Please, read Supplementary Methods to perform model selection for your data.
2023-02-02 08:32:50,079 [110140] WARNING  root:
Signatures will be calculated as mutation frequencies: # mutated ref>alt k-mer counts / # total substitutions
Please, read Supplementary Methods to perform a more accurate signatures calculation
2023-02-02 08:32:50,080 [110140] INFO     root: Parsing genomic regions and mutations...
2023-02-02 08:33:01,448 [110140] INFO     root: Regions parsed
2023-02-02 08:33:01,639 [110140] INFO     root: Mutations parsed
2023-02-02 08:33:01,714 [110140] INFO     root: Validated elements in genomic regions: 20169
2023-02-02 08:33:01,715 [110140] INFO     root: Validated elements with mutations: 5183
2023-02-02 08:33:01,716 [110140] INFO     root: Total substitution mutations: 7913
2023-02-02 08:33:01,717 [110140] INFO     root: Computing signature...
2023-02-02 08:33:05,327 [110140] INFO     root: Signature computed
2023-02-02 08:33:05,349 [110140] INFO     root: Calculating results 1456 elements...
2023-02-02 08:33:05,352 [110140] INFO     root: Iteration 1 of 15
                   simulations: 100%|█████████████████████████████████| 3/3 [00:14<00:00,  4.84s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:36<00:25,  1.04it/s]
2023-02-02 08:34:59,477 [110140] INFO     root: Iteration 2 of 15
                   simulations: 100%|█████████████████████████████████| 7/7 [00:13<00:00,  1.95s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:30<00:24,  1.11it/s]
2023-02-02 08:36:45,989 [110140] INFO     root: Iteration 3 of 15
                   simulations: 100%|█████████████████████████████████| 5/5 [00:22<00:00,  4.46s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:17<00:20,  1.30it/s]
2023-02-02 08:38:29,579 [110140] INFO     root: Iteration 4 of 15
                   simulations: 100%|█████████████████████████████████| 5/5 [00:18<00:00,  3.70s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:34<00:25,  1.07it/s]
2023-02-02 08:40:25,999 [110140] INFO     root: Iteration 5 of 15
                   simulations: 100%|█████████████████████████████████| 7/7 [00:30<00:00,  4.41s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:38<00:26,  1.02it/s]
2023-02-02 08:42:39,555 [110140] INFO     root: Iteration 6 of 15
                   simulations: 100%|███████████████████████████████| 12/12 [00:30<00:00,  2.54s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:16<00:20,  1.32it/s]
2023-02-02 08:44:30,511 [110140] INFO     root: Iteration 7 of 15
                   simulations: 100%|█████████████████████████████████| 7/7 [00:15<00:00,  2.23s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:25<00:22,  1.19it/s]
2023-02-02 08:46:15,403 [110140] INFO     root: Iteration 8 of 15
                   simulations: 100%|███████████████████████████████| 11/11 [00:17<00:00,  1.59s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:29<00:23,  1.13it/s]
2023-02-02 08:48:07,030 [110140] INFO     root: Iteration 9 of 15
                   simulations: 100%|███████████████████████████████| 13/13 [00:51<00:00,  3.95s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:34<00:25,  1.07it/s]
2023-02-02 08:50:36,900 [110140] INFO     root: Iteration 10 of 15
                   simulations: 100%|█████████████████████████████████| 9/9 [00:13<00:00,  1.53s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:41<00:27,  1.00s/it]
2023-02-02 08:52:35,327 [110140] INFO     root: Iteration 11 of 15
                   simulations: 100%|█████████████████████████████████| 5/5 [00:14<00:00,  2.96s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:28<00:23,  1.14it/s]
2023-02-02 08:54:22,365 [110140] INFO     root: Iteration 12 of 15
                   simulations: 100%|███████████████████████████████| 10/10 [00:20<00:00,  2.09s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:37<00:25,  1.04it/s]
2023-02-02 08:56:24,457 [110140] INFO     root: Iteration 13 of 15
                   simulations: 100%|█████████████████████████████████| 7/7 [00:38<00:00,  5.56s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:28<00:23,  1.14it/s]
2023-02-02 08:58:35,490 [110140] INFO     root: Iteration 14 of 15
                   simulations: 100%|███████████████████████████████| 10/10 [00:18<00:00,  1.85s/it]
               post processing:  79%|██████████████████████▉      | 101/128 [01:33<00:24,  1.09it/s]
2023-02-02 09:00:31,237 [110140] INFO     root: Iteration 15 of 15
                   simulations: 100%|█████████████████████████████████| 1/1 [00:05<00:00,  5.99s/it]
               post processing:  45%|█████████████▎                | 57/128 [00:56<01:10,  1.01it/s]
2023-02-02 09:01:40,325 [110140] INFO     root: Elements results calculated
2023-02-02 09:01:40,381 [110140] INFO     root: Clusters results calculated
2023-02-02 09:01:40,383 [110140] INFO     root: Finished