Panaroo: An updated pipeline for pangenome investigation
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load panaroo [+] Loading panaroo 1.4.2 on cn3144 [+] Loading singularity 4.0.1 on cn3144 [user@cn3144 ~]$ panaroo -h usage: panaroo [-h] -i INPUT_FILES [INPUT_FILES ...] -o OUTPUT_DIR --clean-mode {strict,moderate,sensitive} [--remove-invalid-genes] [-c ID] [-f FAMILY_THRESHOLD] [--len_dif_percent LEN_DIF_PERCENT] [--merge_paralogs] [--search_radius SEARCH_RADIUS] [--refind_prop_match REFIND_PROP_MATCH] [--refind_strict] [--min_trailing_support MIN_TRAILING_SUPPORT] [--trailing_recursive TRAILING_RECURSIVE] [--edge_support_threshold EDGE_SUPPORT_THRESHOLD] [--length_outlier_support_proportion LENGTH_OUTLIER_SUPPORT_PROPORTION] [--remove_by_consensus {True,False}] [--high_var_flag CYCLE_THRESHOLD_MIN] [--min_edge_support_sv MIN_EDGE_SUPPORT_SV] [--all_seq_in_graph] [--no_clean_edges] [-a {core,pan}] [--aligner {prank,clustal,mafft}] [--codons] [--core_threshold CORE] [--core_subset SUBSET] [--core_entropy_filter HC_THRESHOLD] [-t N_CPU] [--codon-table TABLE] [--quiet] [--version] panaroo: an updated pipeline for pangenome investigation options: -h, --help show this help message and exit -t N_CPU, --threads N_CPU number of threads to use (default=1) --codon-table TABLE the codon table to use for translation (default=11) --quiet suppress additional output --version show program's version number and exit Input/output: -i INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] input GFF3 files (usually output from running Prokka). Can also take a file listing each gff file line by line. -o OUTPUT_DIR, --out_dir OUTPUT_DIR location of an output directory Mode: --clean-mode {strict,moderate,sensitive} The stringency mode at which to run panaroo. Must be one of 'strict','moderate' or 'sensitive'. Each of these modes can be fine tuned using the additional parameters in the 'Graph correction' section. strict: Requires fairly strong evidence (present in at least 5% of genomes) to keep likely contaminant genes. Will remove genes that are refound more often than they were called originally. moderate: Requires moderate evidence (present in at least 1% of genomes) to keep likely contaminant genes. Keeps genes that are refound more often than they were called originally. sensitive: Does not delete any genes and only performes merge and refinding operations. Useful if rare plasmids are of interest as these are often hard to disguish from contamination. Results will likely include higher number of spurious annotations. --remove-invalid-genes removes annotations that do not conform to the expected Prokka format such as those including premature stop codons. Matching: -c ID, --threshold ID sequence identity threshold (default=0.98) -f FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD protein family sequence identity threshold (default=0.7) --len_dif_percent LEN_DIF_PERCENT length difference cutoff (default=0.98) --merge_paralogs don't split paralogs Refind: --search_radius SEARCH_RADIUS the distance in nucleotides surronding the neighbour of an accessory gene in which to search for it --refind_prop_match REFIND_PROP_MATCH the proportion of an accessory gene that must be found in order to consider it a match --refind_strict Prevent fragmented, misassembled, or potential pseudogene sequences from being re-found. Graph correction: --min_trailing_support MIN_TRAILING_SUPPORT minimum cluster size to keep a gene called at the end of a contig --trailing_recursive TRAILING_RECURSIVE number of times to perform recursive trimming of low support nodes near the end of contigs --edge_support_threshold EDGE_SUPPORT_THRESHOLD minimum support required to keep an edge that has been flagged as a possible mis-assembly --length_outlier_support_proportion LENGTH_OUTLIER_SUPPORT_PROPORTION proportion of genomes supporting a gene with a length more than 1.5x outside the interquatile range for genes in the same cluster (default=0.01). Genes failing this test will be re-annotated at the shorter length --remove_by_consensus {True,False} if a gene is called in the same region with similar sequence a minority of the time, remove it. One of 'True' or 'False' --high_var_flag CYCLE_THRESHOLD_MIN minimum number of nested cycles to call a highly variable gene region (default = 5). --min_edge_support_sv MIN_EDGE_SUPPORT_SV minimum edge support required to call structural variants in the presence/absence sv file --all_seq_in_graph Retains all DNA sequence for each gene cluster in the graph output. Off by default as it uses a large amount of space. --no_clean_edges Turn off edge filtering in the final output graph. Gene alignment: -a {core,pan}, --alignment {core,pan} Output alignments of core genes or all genes. Options are 'core' and 'pan'. Default: 'None' --aligner {prank,clustal,mafft} Specify an aligner. Options:'prank', 'clustal', and default: 'mafft' --codons Generate codon alignments by aligning sequences at the protein level --core_threshold CORE Core-genome sample threshold (default=0.95) --core_subset SUBSET Randomly subset the core genome to these many genes (default=all) --core_entropy_filter HC_THRESHOLD Manually set the Block Mapping and Gathering with Entropy (BMGE) filter. Can be between 0.0 and 1.0. By default this is set using the Tukey outlier method.
Create a batch input file (e.g. panaroo.sh). For example:
#!/bin/bash set -e module load panaroo panaroo -i input.gff -o results --clean-mode sensitive
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] panaroo.sh
Create a swarmfile (e.g. panaroo.swarm). For example:
panaroo -i *.gff -o results --clean-mode strict panaroo -i *.gff -o results --clean-mode strict panaroo -i *.gff -o results --clean-mode strict panaroo -i *.gff -o results --clean-mode strict
Submit this job using the swarm command.
swarm -f panaroo.swarm [-g #] [-t #] --module panaroowhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module panaroo | Loads the panaroo module for each subjob in the swarm |