geNomad on Biowulf

geNomad's primary goal is to identify viruses and plasmids in sequencing data (isolates, metagenomes, and metatranscriptomes).

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):
Please note genomad is compute intensive and needs sufficient memory allocation

[user@biowulf ~]$ sinteractive --mem=20g
salloc: Pending job allocation 13632637
salloc: job 13632637 queued and waiting for resources
salloc: job 13632637 has been allocated resources
salloc: Granted job allocation 13632637
salloc: Waiting for resource configuration
salloc: Nodes cn4308 are ready for job

[user@cn4308]$ module load genomad
[+] Loading genomad  1.7.6  on cn4308
[+] Loading singularity  4.0.1  on cn4308

[user@cn4308]$ genomad -h 

Usage: genomad [OPTIONS] COMMAND [ARGS]...

geNomad: Identification of mobile genetic elements
Read the documentation at: https://portal.nersc.gov/genomad/

╭─ Options ──────────────────────────────────────────────────
│                                                                                                                                                                                                         
│  --version        Show the version and exit.                                                                                                                                                            
│  --help      -h   Show this message and exit.                                                                                                                                                           
│                                                                                                                                                                                                         
╰─────────────────────────────────────────────────────────────
╭─ Database download ─────────────────────────────────────────
│                                                                                                                                                                                                         
│   download-database                  Download the latest version of geNomad's database and save it in the DESTINATION directory.                                                                        
│                                                                                                                                                                                                         
╰─────────────────────────────────────────────────────────────
╭─ End-to-end execution ──────────────────────────────────────
│                                                                                                                                                                                                         
│   end-to-end   Takes an INPUT file (FASTA format) and executes all modules of the geNomad pipeline for plasmid and virus identification. Output files are written in the OUTPUT directory. A local      
│                copy of geNomad's database (DATABASE directory), which can be downloaded with the download-database command, is required. The end-to-end command omits some options. If you want to      
│                have a more granular control over the execution parameters, please execute each module separately.                                                                                       
│                                                                                                                                                                                                         
╰────────────────────────────────────────────────────────────
╭─ Modules ──────────────────────────────────────────────────
│                                                                                                                                                                                                         
│   annotate                    Predict the genes in the INPUT file (FASTA format), annotate them using geNomad's markers (located in the DATABASE directory), and write the results to the OUTPUT        
│                               directory.                                                                                                                                                                
│                                                                                                                                                                                                         
│   find-proviruses             Find integrated viruses within the sequences in INPUT file using the geNomad markers (located in the DATABASE directory) and write the results to the OUTPUT directory.   
│                               This command depends on the data generated by the annotate module.                                                                                                        
│                                                                                                                                                                                                         
│   marker-classification       Classify the sequences in the INPUT file (FASTA format) based on the presence of geNomad markers (located in the DATABASE directory) and write the results to the        
│                               OUTPUT directory. This command depends on the data generated by the annotate module.                                                                                      
│                                                                                                                                                                                                         
│   nn-classification           Classify the sequences in the INPUT file (FASTA format) using the geNomad neural network and write the results to the OUTPUT directory.                                   
│                                                                                                                                                                                                         
│   aggregated-classification   Aggregate the results of the marker-classification and nn-classification modules to classify the sequences in the INPUT file (FASTA format) and write the results to      
│                               the OUTPUT directory.                                                                                                                                                     
│                                                                                                                                                                                                         
│   score-calibration           Performs score calibration of the sequences in the INPUT file (FASTA format) using the batch correction method and write the results to the OUTPUT directory. This        
│                               module requires that at least one of the classification modules was executed previously (marker-classification, nn-classification, aggregated-classification).            
│                                                                                                                                                                                                         
│   summary                     Applies post-classification filters, generates classification reports for the sequences in the INPUT file (FASTA format), and writes them to the OUTPUT directory. This   
│                               module requires that at least one of the base classification modules was executed previously (marker-classification, nn-classification).                                 
│                                                                                                                                                                                                         
╰──────────────────────────────────────────────────────────────

#copy over example data                                                                                       
[user@cn4308]$   cp -a /usr/local/apps/genomad/1.7.6/test-genomes/ . 
[user@cn4308]$  cd test-genomes/ 

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. genomad.sh). For example:

#!/bin/bash

module load genomad

genomad end-to-end --cleanup --splits 8 \
GCF_009025895.1_ASM902589v1_genomic.fna \
genomad_output /fdb/genomad/genomad_db


Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] genomad.sh

Output file:

  ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  │  Executing geNomad annotate (v1.7.6). This will perform gene calling in the input sequences and annotate the predicted proteins with geNomad's markers.                                                 
  │  ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  │  Outputs:                                                                                                                                                                                               
  │    genomad_output/GCF_009025895.1_ASM902589v1_genomic_annotate                                                                                                                                          
  │    ├── GCF_009025895.1_ASM902589v1_genomic_annotate.json (execution parameters)                                                                                                                        
  │    ├── GCF_009025895.1_ASM902589v1_genomic_genes.tsv (gene annotation data)                                                                                                                             
  │    ├── GCF_009025895.1_ASM902589v1_genomic_taxonomy.tsv (taxonomic assignment)                                                                                                                          
  │    ├── GCF_009025895.1_ASM902589v1_genomic_mmseqs2.tsv (MMseqs2 output file)                                                                                                                            
  │    └── GCF_009025895.1_ASM902589v1_genomic_proteins.faa (protein FASTA file)                                                                                                                            
  ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  [15:25:49] Executing genomad annotate.
  [15:25:49] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again.
  [15:25:49] GCF_009025895.1_ASM902589v1_genomic_proteins.faa was found. Skipping gene prediction with pyrodigal-gv.
  [15:25:49] GCF_009025895.1_ASM902589v1_genomic_mmseqs2.tsv was found. Skipping protein annotation with MMseqs2.
  [15:25:50] Gene data was written to GCF_009025895.1_ASM902589v1_genomic_genes.tsv.
  [15:25:50] Taxonomic assignment data was written to GCF_009025895.1_ASM902589v1_genomic_taxonomy.tsv.
  [15:25:50] geNomad annotate finished!
  ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
  │  Executing geNomad find-proviruses (v1.7.6). This will find putative proviral regions within the input sequences.                                                                                       │
  │  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  │
  │  Outputs:                                                                                                                                                                                               │
  │    genomad_output/GCF_009025895.1_ASM902589v1_genomic_find_proviruses                                                                                                                                   │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_find_proviruses.json (execution parameters)                                                                                                                  │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_provirus.tsv (provirus data)                                                                                                                                 │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_provirus.fna (provirus nucleotide sequences)                                                                                                                 │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_provirus_proteins.faa (provirus protein sequences)                                                                                                           │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_provirus_genes.tsv (provirus gene annotation data)                                                                                                           │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_provirus_taxonomy.tsv (provirus taxonomic assignment)                                                                                                        │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_provirus_mmseqs2.tsv (MMseqs2 output file)                                                                                                                   │
  │    └── GCF_009025895.1_ASM902589v1_genomic_provirus_aragorn.tsv (Aragorn output file)                                                                                                                   │
  ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  [15:25:50] Executing genomad find-proviruses.
  [15:25:50] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again.
  [15:25:50] No potential provirus-carrying sequences were identified.
  [15:25:50] geNomad find-proviruses finished!
  ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
  │  Executing geNomad marker-classification (v1.7.6). This will classify the input sequences into chromosome, plasmid, or virus based on the presence of geNomad markers and other gene-related features.  │
  │  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  │
  │  Outputs:                                                                                                                                                                                               │
  │    genomad_output/GCF_009025895.1_ASM902589v1_genomic_marker_classification                                                                                                                             │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_marker_classification.json (execution parameters)                                                                                                            │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_features.tsv (sequence feature data: tabular format)                                                                                                         │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_features.npz (sequence feature data: binary format)                                                                                                          │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_marker_classification.tsv (sequence classification: tabular format)                                                                                          │
  │    └── GCF_009025895.1_ASM902589v1_genomic_marker_classification.npz (sequence classification: binary format)                                                                                           │
  ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  [15:25:50] Executing genomad marker-classification.
  [15:25:50] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again.
  [15:25:50] GCF_009025895.1_ASM902589v1_genomic_features.npz was found. Skipping feature computation.
  [15:25:50] Sequence features in tabular format written to GCF_009025895.1_ASM902589v1_genomic_features.tsv.
  [15:25:50] GCF_009025895.1_ASM902589v1_genomic_marker_classification.npz was found. Skipping sequence classification.
  [15:25:50] Sequence classification in tabular format written to GCF_009025895.1_ASM902589v1_genomic_marker_classification.tsv.
  [15:25:50] geNomad marker-classification finished!
  ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  │  Executing geNomad nn-classification (v1.7.6). This will classify the input sequences into chromosome, plasmid, or virus based on the nucleotide sequence.   |                                           
  │  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
  │  Outputs:                                                                                                                                                    |                                           
  │    genomad_output/GCF_009025895.1_ASM902589v1_genomic_nn_classification                                                                                     |                                            
  │    ├── GCF_009025895.1_ASM902589v1_genomic_nn_classification.json (execution parameters)                                                                                                                
  │    ├── GCF_009025895.1_ASM902589v1_genomic_encoded_sequences (directory containing encoded sequence data)                                                                                               
  │    ├── GCF_009025895.1_ASM902589v1_genomic_nn_classification.tsv (contig classification: tabular format)                                                                                                
  │    └── GCF_009025895.1_ASM902589v1_genomic_nn_classification.npz (contig classification: binary format)                                                                                                 
  ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  [15:25:53] Executing genomad nn-classification.
  [15:25:53] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again.
  [15:25:53] Creating the genomad_output/GCF_009025895.1_ASM902589v1_genomic_nn_classification/GCF_009025895.1_ASM902589v1_genomic_encoded_sequences directory.
  [15:25:55] Encoded sequence data written to GCF_009025895.1_ASM902589v1_genomic_encoded_sequences.
  [15:25:55] GCF_009025895.1_ASM902589v1_genomic_nn_classification.npz was found. Skipping sequence classification.
  [15:25:55] Deleting encoded sequence data.
  [15:25:55] Sequence classification in tabular format written to GCF_009025895.1_ASM902589v1_genomic_nn_classification.tsv.
  [15:25:55] geNomad nn-classification finished!
  ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
  │  Executing geNomad aggregated-classification (v1.7.6). This will aggregate the results of the marker-classification and nn-classification modules to classify the input sequences into chromosome,      │
  │  plasmid, or virus.                                                                                                                                                                                     │
  │  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  │
  │  Outputs:                                                                                                                                                                                               │
  │    genomad_output/GCF_009025895.1_ASM902589v1_genomic_aggregated_classification                                                                                                                         │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.json (execution parameters)                                                                                                        │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.tsv (sequence classification: tabular format)                                                                                      │
  │    └── GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.npz (sequence classification: binary format)                                                                                       │
  ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  [15:25:55] Executing genomad aggregated-classification.
  [15:25:55] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again.
  [15:25:55] The total marker frequencies of the input sequences were computed.
  [15:25:55] GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.npz was found. Skipping sequence classification.
  [15:25:55] Sequence classification in tabular format written to GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.tsv.
  [15:25:55] geNomad aggregated-classification finished!
  ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
  │  Executing geNomad summary (v1.7.6). This will summarize the results across modules into a classification report.                                                                                       │
  │  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  │
  │  Outputs:                                                                                                                                                                                               │
  │    genomad_output/GCF_009025895.1_ASM902589v1_genomic_summary                                                                                                                                           │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_summary.json (execution parameters)                                                                                                                          │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_virus_summary.tsv (virus classification summary)                                                                                                             │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_plasmid_summary.tsv (plasmid classification summary)                                                                                                         │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_virus.fna (virus nucleotide FASTA file)                                                                                                                      │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_plasmid.fna (plasmid nucleotide FASTA file)                                                                                                                  │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_virus_proteins.faa (virus protein FASTA file)                                                                                                                │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_plasmid_proteins.faa (plasmid protein FASTA file)                                                                                                            │
  │    ├── GCF_009025895.1_ASM902589v1_genomic_virus_genes.tsv (virus gene annotation data)                                                                                                                 │
  │    └── GCF_009025895.1_ASM902589v1_genomic_plasmid_genes.tsv (plasmid gene annotation data)                                                                                                             │
  ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  [15:25:55] Executing genomad summary.
  [15:25:55] Using scores from aggregated-classification.
  [15:25:55] 6 plasmid(s) and 0 virus(es) were identified.
  [15:25:55] Nucleotide sequences were written to GCF_009025895.1_ASM902589v1_genomic_plasmid.fna and GCF_009025895.1_ASM902589v1_genomic_virus.fna.
  [15:25:55] Protein sequences were written to GCF_009025895.1_ASM902589v1_genomic_plasmid_proteins.faa and GCF_009025895.1_ASM902589v1_genomic_virus_proteins.faa.
  [15:25:55] Gene annotation data was written to GCF_009025895.1_ASM902589v1_genomic_plasmid_genes.tsv and GCF_009025895.1_ASM902589v1_genomic_virus_genes.tsv.
  [15:25:55] Summary files were written to GCF_009025895.1_ASM902589v1_genomic_plasmid_summary.tsv and GCF_009025895.1_ASM902589v1_genomic_virus_summary.tsv.
  [15:25:55] geNomad summary finished!