Panaroo on Biowulf

Panaroo: An updated pipeline for pangenome investigation

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load panaroo
[+] Loading panaroo  1.4.2  on cn3144
[+] Loading singularity  4.0.1  on cn3144
[user@cn3144 ~]$  panaroo -h 
usage: panaroo [-h] -i INPUT_FILES [INPUT_FILES ...] -o OUTPUT_DIR --clean-mode
               {strict,moderate,sensitive} [--remove-invalid-genes] [-c ID]
               [-f FAMILY_THRESHOLD] [--len_dif_percent LEN_DIF_PERCENT]
               [--merge_paralogs] [--search_radius SEARCH_RADIUS]
               [--refind_prop_match REFIND_PROP_MATCH] [--refind_strict]
               [--min_trailing_support MIN_TRAILING_SUPPORT]
               [--trailing_recursive TRAILING_RECURSIVE]
               [--edge_support_threshold EDGE_SUPPORT_THRESHOLD]
               [--length_outlier_support_proportion LENGTH_OUTLIER_SUPPORT_PROPORTION]
               [--remove_by_consensus {True,False}] [--high_var_flag CYCLE_THRESHOLD_MIN]
               [--min_edge_support_sv MIN_EDGE_SUPPORT_SV] [--all_seq_in_graph]
               [--no_clean_edges] [-a {core,pan}] [--aligner {prank,clustal,mafft}]
               [--codons] [--core_threshold CORE] [--core_subset SUBSET]
               [--core_entropy_filter HC_THRESHOLD] [-t N_CPU] [--codon-table TABLE]
               [--quiet] [--version]

panaroo: an updated pipeline for pangenome investigation

options:
  -h, --help            show this help message and exit
  -t N_CPU, --threads N_CPU
                        number of threads to use (default=1)
  --codon-table TABLE   the codon table to use for translation (default=11)
  --quiet               suppress additional output
  --version             show program's version number and exit

Input/output:
  -i INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
                        input GFF3 files (usually output from running Prokka). Can also
                        take a file listing each gff file line by line.
  -o OUTPUT_DIR, --out_dir OUTPUT_DIR
                        location of an output directory

Mode:
  --clean-mode {strict,moderate,sensitive}
                        The stringency mode at which to run panaroo. Must be
                        one of 'strict','moderate' or 'sensitive'. Each of
                        these modes can be fine tuned using the additional
                        parameters in the 'Graph correction' section.

                        strict:
                        Requires fairly strong evidence (present in  at least
                        5% of genomes) to keep likely contaminant genes. Will
                        remove genes that are refound more often than they were
                        called originally.

                        moderate:
                        Requires moderate evidence (present in  at least 1% of
                        genomes) to keep likely contaminant genes. Keeps genes
                        that are refound more often than they were called
                        originally.

                        sensitive:
                        Does not delete any genes and only performes merge and
                        refinding operations. Useful if rare plasmids are of
                        interest as these are often hard to disguish from
                        contamination. Results will likely include  higher
                        number of spurious annotations.
  --remove-invalid-genes
                        removes annotations that do not conform to the expected Prokka
                        format such as those including premature stop codons.

Matching:
  -c ID, --threshold ID
                        sequence identity threshold (default=0.98)
  -f FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD
                        protein family sequence identity threshold (default=0.7)
  --len_dif_percent LEN_DIF_PERCENT
                        length difference cutoff (default=0.98)
  --merge_paralogs      don't split paralogs

Refind:
  --search_radius SEARCH_RADIUS
                        the distance in nucleotides surronding the neighbour of an
                        accessory gene in which to search for it
  --refind_prop_match REFIND_PROP_MATCH
                        the proportion of an accessory gene that must be found in order
                        to consider it a match
  --refind_strict       Prevent fragmented, misassembled, or potential pseudogene
                        sequences from being re-found.

Graph correction:
  --min_trailing_support MIN_TRAILING_SUPPORT
                        minimum cluster size to keep a gene called at the end of a contig
  --trailing_recursive TRAILING_RECURSIVE
                        number of times to perform recursive trimming of low support
                        nodes near the end of contigs
  --edge_support_threshold EDGE_SUPPORT_THRESHOLD
                        minimum support required to keep an edge that has been flagged as
                        a possible mis-assembly
  --length_outlier_support_proportion LENGTH_OUTLIER_SUPPORT_PROPORTION
                        proportion of genomes supporting a gene with a length more than
                        1.5x outside the interquatile range for genes in the same cluster
                        (default=0.01). Genes failing this test will be re-annotated at
                        the shorter length
  --remove_by_consensus {True,False}
                        if a gene is called in the same region with similar sequence a
                        minority of the time, remove it. One of 'True' or 'False'
  --high_var_flag CYCLE_THRESHOLD_MIN
                        minimum number of nested cycles to call a highly variable gene
                        region (default = 5).
  --min_edge_support_sv MIN_EDGE_SUPPORT_SV
                        minimum edge support required to call structural variants in the
                        presence/absence sv file
  --all_seq_in_graph    Retains all DNA sequence for each gene cluster in the graph
                        output. Off by default as it uses a large amount of space.
  --no_clean_edges      Turn off edge filtering in the final output graph.

Gene alignment:
  -a {core,pan}, --alignment {core,pan}
                        Output alignments of core genes or all genes. Options are 'core'
                        and 'pan'. Default: 'None'
  --aligner {prank,clustal,mafft}
                        Specify an aligner. Options:'prank', 'clustal', and default:
                        'mafft'
  --codons              Generate codon alignments by aligning sequences at the protein
                        level
  --core_threshold CORE
                        Core-genome sample threshold (default=0.95)
  --core_subset SUBSET  Randomly subset the core genome to these many genes (default=all)
  --core_entropy_filter HC_THRESHOLD
                        Manually set the Block Mapping and Gathering with Entropy (BMGE)
                        filter. Can be between 0.0 and 1.0. By default this is set using
                        the Tukey outlier method.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. panaroo.sh). For example:

#!/bin/bash
set -e
module load panaroo
panaroo -i input.gff -o results --clean-mode sensitive

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] panaroo.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. panaroo.swarm). For example:

panaroo -i *.gff -o results --clean-mode strict
panaroo -i *.gff -o results --clean-mode strict
panaroo -i *.gff -o results --clean-mode strict
panaroo -i *.gff -o results --clean-mode strict

Submit this job using the swarm command.

swarm -f panaroo.swarm [-g #] [-t #] --module panaroo
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module panaroo Loads the panaroo module for each subjob in the swarm