funannotate: a pipeline for genome annotation

Funannotate is a genome prediction, annotation, and comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accomodate larger genomes.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

[user@biowulf]$ sinteractive --gres=lscratch:10 --mem=10g -c4
[user@cn4279 ~]$ module load funannotate 
[+] Loading bamtools  2.5.2  on cn4279
[+] Loading singularity  4.1.5  on cn4279
[+] Loading bedtools  2.31.1
[+] Loading blast 2.15.0+  ...
[+] Loading eggnog-mapper  2.1.6
[+] Loading exonerate, version 2.4.0...
[+] Loading hmmer  3.3.2  on cn4279
[+] Loading hisat2  2.2.1  on cn4279
[+] Loading samtools 1.21  ...
[+] Loading kallisto, version 0.51.1...
[+] Loading Mafft 7.526  ...
[+] Loading minimap2, version 2.28...
[+] Loading mysql 8.0.34 on cn4279
[+] Loading gcc  8.5.0  ...
[+] Loading repeatmasker  4.1.7-p1  on cn4279
[+] Loading salmon  1.10.1
[+] Loading signalp 6.0g  ...
[+] Loading stringtie  2.2.3
[+] Loading Trinity 2.15.1  ...
[+] Loading tantan  40
[+] Loading trnascan-se  2.0.9  on cn4279
[+] Loading trimal  1.2rev59
[+] Loading trimmomatic  0.39  on cn4279
[+] Loading ucsc 472 on cn4279
[+] Loading funannotate  1.8.17  on cn4279
The funannotate data processing pipeline involves several core modules:
[user@cn4279 ~]$ funannotate -h 

Usage:       funannotate <command> <arguments>
version:     1.8.17

Description: Funannotate is a genome prediction, annotation, and comparison pipeline.

Commands:
  clean       Find/remove small repetitive contigs
  sort        Sort by size and rename contig headers
  mask        Repeatmask genome assembly

  train       RNA-seq mediated training of Augustus/GeneMark
  predict     Run gene prediction pipeline
  fix         Fix annotation errors (generate new GenBank file)
  update      RNA-seq/PASA mediated gene model refinement
  remote      Partial functional annotation using remote servers
  iprscan     InterProScan5 search (Docker or local)
  annotate    Assign functional annotation to gene predictions
  compare     Compare funannotated genomes

  util        Format conversion and misc utilities
  setup       Setup/Install databases
  test        Download/Run funannotate installation tests
  check       Check Python, Perl, and External dependencies [--show-versions]
  species     list pre-trained Augustus species
  database    Manage databases
  outgroups   Manage outgroups for funannotate compare
These modules are supposed to be used in the following order:
clean → sort → mask → train → predict → update → annotate → compare
If you plan to use GeneMark, copy its licence key to your home folder:
[user@cn4279 ~] cp /usr/local/apps/funannotate/1.8.17/GeneMark/gm_key_64 ~/.gm_key
Now proceed with setting up:

[user@biowulf]$ funannotate setup -h

Usage:       funannotate clean <arguments>
version:     1.8.17

Description: The script sorts contigs by size, starting with shortest contigs it uses minimap2
             to find contigs duplicated elsewhere, and then removes duplicated contigs.

Arguments:
  -i, --input    Multi-fasta genome file (Required)
  -o, --out      Cleaned multi-fasta output file (Required)
  -p, --pident   Percent identity of overlap. Default = 95
  -c, --cov      Percent coverage of overlap. Default = 95
  -m, --minlen   Minimum length of contig to keep. Default = 500
  --exhaustive   Test every contig. Default is to stop at N50 value.

[user@biowulf]$ funannotate setup -i pfam -d ./databases --force 
-------------------------------------------------------
[Aug 15 07:50 PM]: OS: Debian GNU/Linux 10, 56 cores, ~ 264 GB RAM. Python: 3.8.13
[Aug 15 07:50 PM]: Running 1.8.13
[Aug 15 07:50 PM]: Database location: ./databases
[Aug 15 07:50 PM]: Retrieving download links from GitHub Repo
[Aug 15 07:50 PM]: Parsing Augustus pre-trained species and porting to funannotate
[Aug 15 07:50 PM]: Downloading UniProtKB/SwissProt database
[Aug 15 07:50 PM]: Downloading: http://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz Bytes: 91192930
[Aug 15 07:50 PM]: Downloading: http://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt Bytes: 151
[Aug 15 07:50 PM]: Building diamond database
[Aug 15 07:50 PM]: UniProtKB Database: version=2022_03 date=2022-08-03 records=568,002
[Aug 15 07:50 PM]: Funannoate setup complete. Add this to ~/.bash_profile or ~/.bash_aliases:

        export FUNANNOTATE_DB=/gpfs/gsfs7/users/user/funannotate/databases

[user@biowulf]$ funannotate sort -h

Usage:       funannotate sort <arguments>
version:     1.8.17

Description: Script will download/format necessary databases for funannotate.

Options:
  -i, --install    Download format databases. Default: all
                     [merops,uniprot,dbCAN,pfam,repeats,go,
                      mibig,interpro,busco_outgroups,gene2product]
  -b, --busco_db   Busco Databases to install. Default: dikarya [all,fungi,aves,etc]
  -d, --database   Path to funannotate database
  -u, --update     Check remote md5 and update if newer version found
  -f, --force      Force overwriting database
  -w, --wget       Use wget to download instead of python requests
  -l, --local      Use local resource JSON file instead of current on github
[user@biowulf]$ funannotate -i pfam 
[Aug 15 01:23 PM]: OS: CentOS Linux 7, 56 cores, ~ 264 GB RAM. Python: 3.7.11
[Aug 15 01:23 PM]: Running 1.8.17
[Aug 15 01:23 PM]: Database location: /fdb/funannotate/db
[Aug 15 01:23 PM]: Retrieving download links from GitHub Repo
[Aug 15 01:23 PM]: Parsing Augustus pre-trained species and porting to funannotate
[Aug 15 01:23 PM]: Pfam Database: version=34.0 date=2021-03 records=19,179
[user@biowulf]$ funannotate mask -h 

Usage:       funannotate mask <arguments>
version:     1.8.17

Description: This script is a wrapper for repeat masking. Default is to run very simple
             repeat masking with tantan. The script can also run RepeatMasker and/or
             RepeatModeler. It will generate a softmasked genome. Tantan is probably not
             sufficient for soft-masking an assembly, but with RepBase no longer being
             available RepeatMasker/Modeler may not be functional for many users.

Arguments:
  -i, --input                  Multi-FASTA genome file. (Required)
  -o, --out                    Output softmasked FASTA file. (Required)

Optional:
  -m, --method                 Method to use. Default: tantan [repeatmasker, repeatmodeler]
  -s, --repeatmasker_species   Species to use for RepeatMasker
  -l, --repeatmodeler_lib      Custom repeat database (FASTA format)
  --cpus                       Number of cpus to use. Default: 2
  --debug                      Keep intermediate files
etc.
[user@cn4279 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226