OncodriveCLUSTL is a sequence-based clustering algorithm to detect significant clustering signals across genomic regions. It is based based on a local background model derived from the simulation of mutations accounting for the composition of trior penta-nucleotide context substitutions observed in the cohort under study.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=75g -c20 --gres=lscratch:20 [user@cn3335 ~]$ module load oncodriveCLUSTL [+] Loading singularity 3.10.5 on cn4338 [+] Loading oncodriveCLUSTL 1.1.1 [user@cn3335 ~]$ oncodriveclustl -h Usage: oncodriveclustl [OPTIONS] OncodriveCLUSTL is a sequence based clustering method to identify cancer drivers across the genome Args: input_file (str): path to mutations file regions_file (str): path to input genomic coordinates file output_directory(str): path to output directory. Output files will be generated in it. input_signature (str): path to file containing input context based mutational probabilities. By default (when no input signatures), OncodriveCLUSTL will calculate them from the mutations input file. elements_file (str): path to file containing one element per row (optional) to analyzed the listed elements. By default, OncodriveCLUSTL analyzes all genomic elements contained in `regions_file`. elements (str): genomic element symbol (optional). The analysis will be performed only on the specified GEs. genome (str): genome to use: 'hg38', 'hg19', 'mm10', 'c3h', 'car', 'cast' and 'f344' element_mutations (int): minimum number of mutations per genomic element to undertake analysis cluster_mutations (int): minimum number of mutations to define a cluster smooth_window (int): Tukey kernel smoothing window length cluster_window (int): clustering window length kmer (int): context nucleotides to calculate the mutational probabilities (trinucleotides or pentanucleotides) n_simulations (int): number of simulations simulation_mode (str): simulation mode simulation_window (int): window length to simulate mutations signature_calculation (str): signature calculation, mutation frequencies (default) or mutation counts normalized by k-mer region counts signature_group (str): header of the column to group signatures. One signature will be computed for each group cores (int): number of CPUs to use seed (int): seed log_level (str): verbosity of the logger concatenate (bool): flag to calculate clustering on collapsed genomic regions (e.g., coding regions in a gene) clustplot (bool): flag to generate a needle plot with clusters for an element qqplot (bool): flat to generate a quantile-quantile (QQ) plot for a dataset gzip (bool): flag to generate GZIP compressed output files Returns: None Options: -i, --input-file PATH File containing somatic mutations [required] -r, --regions-file PATH File with the genomic regions to analyze [required] -o, --output-directory TEXT Output directory to be created [required] -sig, --input-signature PATH File containing input context based mutational probabilities (signature) -ef, --elements-file PATH File with the symbols of the elements to analyze -e, --elements TEXT Symbol of the element(s) to analyze -g, --genome [hg38|hg19|mm10|c3h|car|cast|f344] Genome to use -emut, --element-mutations INTEGER Cutoff of element mutations. Default is 2 -cmut, --cluster-mutations INTEGER Cutoff of cluster mutations. Default is 2 -sw, --smooth-window INTEGER RANGE Smoothing window. Default is 11 -cw, --cluster-window INTEGER RANGE Cluster window. Default is 11 -kmer, --kmer [3|5] K-mer nucleotide context -n, --n-simulations INTEGER number of simulations. Default is 1000 -sim, --simulation-mode [mutation_centered|region_restricted] Simulation mode -simw, --simulation-window INTEGER RANGE Simulation window. Default is 31 -sigcalc, --signature-calculation [frequencies|region_normalized] Signature calculation: mutation frequencies (default) or k-mer mutation counts normalized by k-mer region counts -siggroup, --signature-group [SIGNATURE|SAMPLE|CANCER_TYPE] Header of the column to group signatures calculation -c, --cores INTEGER RANGE Number of cores to use in the computation. By default it will use all the available cores. --seed INTEGER Seed to use in the simulations --log-level [debug|info|warning|error|critical] Verbosity of the logger --concatenate Calculate clustering on concatenated genomic regions (e.g., exons in coding sequences) --clustplot Generate a needle plot with clusters for an element --qqplot Generate a quantile-quantile (QQ) plot for a dataset --gzip Gzip compress files -h, --help Show this message and exit.Copy sample data to the current folder:
[user@cn3335 ~]$ cp -r $ODCLUSTL_DATA/* .Now let's run oncodriveCLUSTL on the sample data. According to the the oncodriveCLUSTL documentation, "The first time that you run OncodriveCLUSTL with a given reference genome, it will download it from our servers. By default the downloaded datasets go to ~/.bgdata. If you want to move these datasets to another folder you have to define the system environment variable BGDATA_LOCAL with an export command."
[user@cn3335 ~]$ oncodriveclustl -i PAAD.tsv.gz -r cds.hg19.regions.gz -o test_output 2023-02-02 08:32:50,073 [110140] INFO root: OncodriveCLUSTL 2023-02-02 08:32:50,073 [110140] INFO root: input_file: PAAD.tsv.gz regions_file: cds.hg19.regions.gz input_signature: None output_directory: test_output genome: hg19 element_mutations: 2 cluster_mutations: 2 concatenate: False smooth_window: 11 cluster_window: 11 k-mer: 3 simulation_mode: mutation_centered simulation_window: 31 n_simulations: 1000 signature_calculation: frequencies signature_group: None cores: 128 gzip: False seed: None 2023-02-02 08:32:50,075 [110140] INFO root: Initializing OncodriveCLUSTL... 2023-02-02 08:32:50,077 [110140] WARNING root: Running with default simulating, smoothing and clustering OncodriveCLUSTL parameters. Default parameters may not be optimal for your data. Please, read Supplementary Methods to perform model selection for your data. 2023-02-02 08:32:50,079 [110140] WARNING root: Signatures will be calculated as mutation frequencies: # mutated ref>alt k-mer counts / # total substitutions Please, read Supplementary Methods to perform a more accurate signatures calculation 2023-02-02 08:32:50,080 [110140] INFO root: Parsing genomic regions and mutations... 2023-02-02 08:33:01,448 [110140] INFO root: Regions parsed 2023-02-02 08:33:01,639 [110140] INFO root: Mutations parsed 2023-02-02 08:33:01,714 [110140] INFO root: Validated elements in genomic regions: 20169 2023-02-02 08:33:01,715 [110140] INFO root: Validated elements with mutations: 5183 2023-02-02 08:33:01,716 [110140] INFO root: Total substitution mutations: 7913 2023-02-02 08:33:01,717 [110140] INFO root: Computing signature... 2023-02-02 08:33:05,327 [110140] INFO root: Signature computed 2023-02-02 08:33:05,349 [110140] INFO root: Calculating results 1456 elements... 2023-02-02 08:33:05,352 [110140] INFO root: Iteration 1 of 15 simulations: 100%|█████████████████████████████████| 3/3 [00:14<00:00, 4.84s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:36<00:25, 1.04it/s] 2023-02-02 08:34:59,477 [110140] INFO root: Iteration 2 of 15 simulations: 100%|█████████████████████████████████| 7/7 [00:13<00:00, 1.95s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:30<00:24, 1.11it/s] 2023-02-02 08:36:45,989 [110140] INFO root: Iteration 3 of 15 simulations: 100%|█████████████████████████████████| 5/5 [00:22<00:00, 4.46s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:17<00:20, 1.30it/s] 2023-02-02 08:38:29,579 [110140] INFO root: Iteration 4 of 15 simulations: 100%|█████████████████████████████████| 5/5 [00:18<00:00, 3.70s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:34<00:25, 1.07it/s] 2023-02-02 08:40:25,999 [110140] INFO root: Iteration 5 of 15 simulations: 100%|█████████████████████████████████| 7/7 [00:30<00:00, 4.41s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:38<00:26, 1.02it/s] 2023-02-02 08:42:39,555 [110140] INFO root: Iteration 6 of 15 simulations: 100%|███████████████████████████████| 12/12 [00:30<00:00, 2.54s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:16<00:20, 1.32it/s] 2023-02-02 08:44:30,511 [110140] INFO root: Iteration 7 of 15 simulations: 100%|█████████████████████████████████| 7/7 [00:15<00:00, 2.23s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:25<00:22, 1.19it/s] 2023-02-02 08:46:15,403 [110140] INFO root: Iteration 8 of 15 simulations: 100%|███████████████████████████████| 11/11 [00:17<00:00, 1.59s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:29<00:23, 1.13it/s] 2023-02-02 08:48:07,030 [110140] INFO root: Iteration 9 of 15 simulations: 100%|███████████████████████████████| 13/13 [00:51<00:00, 3.95s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:34<00:25, 1.07it/s] 2023-02-02 08:50:36,900 [110140] INFO root: Iteration 10 of 15 simulations: 100%|█████████████████████████████████| 9/9 [00:13<00:00, 1.53s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:41<00:27, 1.00s/it] 2023-02-02 08:52:35,327 [110140] INFO root: Iteration 11 of 15 simulations: 100%|█████████████████████████████████| 5/5 [00:14<00:00, 2.96s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:28<00:23, 1.14it/s] 2023-02-02 08:54:22,365 [110140] INFO root: Iteration 12 of 15 simulations: 100%|███████████████████████████████| 10/10 [00:20<00:00, 2.09s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:37<00:25, 1.04it/s] 2023-02-02 08:56:24,457 [110140] INFO root: Iteration 13 of 15 simulations: 100%|█████████████████████████████████| 7/7 [00:38<00:00, 5.56s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:28<00:23, 1.14it/s] 2023-02-02 08:58:35,490 [110140] INFO root: Iteration 14 of 15 simulations: 100%|███████████████████████████████| 10/10 [00:18<00:00, 1.85s/it] post processing: 79%|██████████████████████▉ | 101/128 [01:33<00:24, 1.09it/s] 2023-02-02 09:00:31,237 [110140] INFO root: Iteration 15 of 15 simulations: 100%|█████████████████████████████████| 1/1 [00:05<00:00, 5.99s/it] post processing: 45%|█████████████▎ | 57/128 [00:56<01:10, 1.01it/s] 2023-02-02 09:01:40,325 [110140] INFO root: Elements results calculated 2023-02-02 09:01:40,381 [110140] INFO root: Clusters results calculated 2023-02-02 09:01:40,383 [110140] INFO root: Finished