geNomad's primary goal is to identify viruses and plasmids in sequencing data (isolates, metagenomes, and metatranscriptomes).
Allocate an interactive session and run the program.
Sample session (user input in bold):
Please note genomad is compute intensive and needs sufficient memory allocation
[user@biowulf ~]$ sinteractive --mem=20g salloc: Pending job allocation 13632637 salloc: job 13632637 queued and waiting for resources salloc: job 13632637 has been allocated resources salloc: Granted job allocation 13632637 salloc: Waiting for resource configuration salloc: Nodes cn4308 are ready for job [user@cn4308]$ module load genomad [+] Loading genomad 1.7.6 on cn4308 [+] Loading singularity 4.0.1 on cn4308 [user@cn4308]$ genomad -h Usage: genomad [OPTIONS] COMMAND [ARGS]... geNomad: Identification of mobile genetic elements Read the documentation at: https://portal.nersc.gov/genomad/ ╭─ Options ────────────────────────────────────────────────── │ │ --version Show the version and exit. │ --help -h Show this message and exit. │ ╰───────────────────────────────────────────────────────────── ╭─ Database download ───────────────────────────────────────── │ │ download-database Download the latest version of geNomad's database and save it in the DESTINATION directory. │ ╰───────────────────────────────────────────────────────────── ╭─ End-to-end execution ────────────────────────────────────── │ │ end-to-end Takes an INPUT file (FASTA format) and executes all modules of the geNomad pipeline for plasmid and virus identification. Output files are written in the OUTPUT directory. A local │ copy of geNomad's database (DATABASE directory), which can be downloaded with the download-database command, is required. The end-to-end command omits some options. If you want to │ have a more granular control over the execution parameters, please execute each module separately. │ ╰──────────────────────────────────────────────────────────── ╭─ Modules ────────────────────────────────────────────────── │ │ annotate Predict the genes in the INPUT file (FASTA format), annotate them using geNomad's markers (located in the DATABASE directory), and write the results to the OUTPUT │ directory. │ │ find-proviruses Find integrated viruses within the sequences in INPUT file using the geNomad markers (located in the DATABASE directory) and write the results to the OUTPUT directory. │ This command depends on the data generated by the annotate module. │ │ marker-classification Classify the sequences in the INPUT file (FASTA format) based on the presence of geNomad markers (located in the DATABASE directory) and write the results to the │ OUTPUT directory. This command depends on the data generated by the annotate module. │ │ nn-classification Classify the sequences in the INPUT file (FASTA format) using the geNomad neural network and write the results to the OUTPUT directory. │ │ aggregated-classification Aggregate the results of the marker-classification and nn-classification modules to classify the sequences in the INPUT file (FASTA format) and write the results to │ the OUTPUT directory. │ │ score-calibration Performs score calibration of the sequences in the INPUT file (FASTA format) using the batch correction method and write the results to the OUTPUT directory. This │ module requires that at least one of the classification modules was executed previously (marker-classification, nn-classification, aggregated-classification). │ │ summary Applies post-classification filters, generates classification reports for the sequences in the INPUT file (FASTA format), and writes them to the OUTPUT directory. This │ module requires that at least one of the base classification modules was executed previously (marker-classification, nn-classification). │ ╰────────────────────────────────────────────────────────────── #copy over example data [user@cn4308]$ cp -a /usr/local/apps/genomad/1.7.6/test-genomes/ . [user@cn4308]$ cd test-genomes/
Create a batch input file (e.g. genomad.sh). For example:
#!/bin/bash module load genomad genomad end-to-end --cleanup --splits 8 \ GCF_009025895.1_ASM902589v1_genomic.fna \ genomad_output /fdb/genomad/genomad_db
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] genomad.sh
Output file:
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ Executing geNomad annotate (v1.7.6). This will perform gene calling in the input sequences and annotate the predicted proteins with geNomad's markers. │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ Outputs: │ genomad_output/GCF_009025895.1_ASM902589v1_genomic_annotate │ ├── GCF_009025895.1_ASM902589v1_genomic_annotate.json (execution parameters) │ ├── GCF_009025895.1_ASM902589v1_genomic_genes.tsv (gene annotation data) │ ├── GCF_009025895.1_ASM902589v1_genomic_taxonomy.tsv (taxonomic assignment) │ ├── GCF_009025895.1_ASM902589v1_genomic_mmseqs2.tsv (MMseqs2 output file) │ └── GCF_009025895.1_ASM902589v1_genomic_proteins.faa (protein FASTA file) ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [15:25:49] Executing genomad annotate. [15:25:49] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again. [15:25:49] GCF_009025895.1_ASM902589v1_genomic_proteins.faa was found. Skipping gene prediction with pyrodigal-gv. [15:25:49] GCF_009025895.1_ASM902589v1_genomic_mmseqs2.tsv was found. Skipping protein annotation with MMseqs2. [15:25:50] Gene data was written to GCF_009025895.1_ASM902589v1_genomic_genes.tsv. [15:25:50] Taxonomic assignment data was written to GCF_009025895.1_ASM902589v1_genomic_taxonomy.tsv. [15:25:50] geNomad annotate finished! ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Executing geNomad find-proviruses (v1.7.6). This will find putative proviral regions within the input sequences. │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ Outputs: │ │ genomad_output/GCF_009025895.1_ASM902589v1_genomic_find_proviruses │ │ ├── GCF_009025895.1_ASM902589v1_genomic_find_proviruses.json (execution parameters) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_provirus.tsv (provirus data) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_provirus.fna (provirus nucleotide sequences) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_provirus_proteins.faa (provirus protein sequences) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_provirus_genes.tsv (provirus gene annotation data) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_provirus_taxonomy.tsv (provirus taxonomic assignment) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_provirus_mmseqs2.tsv (MMseqs2 output file) │ │ └── GCF_009025895.1_ASM902589v1_genomic_provirus_aragorn.tsv (Aragorn output file) │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ [15:25:50] Executing genomad find-proviruses. [15:25:50] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again. [15:25:50] No potential provirus-carrying sequences were identified. [15:25:50] geNomad find-proviruses finished! ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Executing geNomad marker-classification (v1.7.6). This will classify the input sequences into chromosome, plasmid, or virus based on the presence of geNomad markers and other gene-related features. │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ Outputs: │ │ genomad_output/GCF_009025895.1_ASM902589v1_genomic_marker_classification │ │ ├── GCF_009025895.1_ASM902589v1_genomic_marker_classification.json (execution parameters) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_features.tsv (sequence feature data: tabular format) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_features.npz (sequence feature data: binary format) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_marker_classification.tsv (sequence classification: tabular format) │ │ └── GCF_009025895.1_ASM902589v1_genomic_marker_classification.npz (sequence classification: binary format) │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ [15:25:50] Executing genomad marker-classification. [15:25:50] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again. [15:25:50] GCF_009025895.1_ASM902589v1_genomic_features.npz was found. Skipping feature computation. [15:25:50] Sequence features in tabular format written to GCF_009025895.1_ASM902589v1_genomic_features.tsv. [15:25:50] GCF_009025895.1_ASM902589v1_genomic_marker_classification.npz was found. Skipping sequence classification. [15:25:50] Sequence classification in tabular format written to GCF_009025895.1_ASM902589v1_genomic_marker_classification.tsv. [15:25:50] geNomad marker-classification finished! ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ Executing geNomad nn-classification (v1.7.6). This will classify the input sequences into chromosome, plasmid, or virus based on the nucleotide sequence. | │ ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ Outputs: | │ genomad_output/GCF_009025895.1_ASM902589v1_genomic_nn_classification | │ ├── GCF_009025895.1_ASM902589v1_genomic_nn_classification.json (execution parameters) │ ├── GCF_009025895.1_ASM902589v1_genomic_encoded_sequences (directory containing encoded sequence data) │ ├── GCF_009025895.1_ASM902589v1_genomic_nn_classification.tsv (contig classification: tabular format) │ └── GCF_009025895.1_ASM902589v1_genomic_nn_classification.npz (contig classification: binary format) ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [15:25:53] Executing genomad nn-classification. [15:25:53] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again. [15:25:53] Creating the genomad_output/GCF_009025895.1_ASM902589v1_genomic_nn_classification/GCF_009025895.1_ASM902589v1_genomic_encoded_sequences directory. [15:25:55] Encoded sequence data written to GCF_009025895.1_ASM902589v1_genomic_encoded_sequences. [15:25:55] GCF_009025895.1_ASM902589v1_genomic_nn_classification.npz was found. Skipping sequence classification. [15:25:55] Deleting encoded sequence data. [15:25:55] Sequence classification in tabular format written to GCF_009025895.1_ASM902589v1_genomic_nn_classification.tsv. [15:25:55] geNomad nn-classification finished! ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Executing geNomad aggregated-classification (v1.7.6). This will aggregate the results of the marker-classification and nn-classification modules to classify the input sequences into chromosome, │ │ plasmid, or virus. │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ Outputs: │ │ genomad_output/GCF_009025895.1_ASM902589v1_genomic_aggregated_classification │ │ ├── GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.json (execution parameters) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.tsv (sequence classification: tabular format) │ │ └── GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.npz (sequence classification: binary format) │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ [15:25:55] Executing genomad aggregated-classification. [15:25:55] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again. [15:25:55] The total marker frequencies of the input sequences were computed. [15:25:55] GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.npz was found. Skipping sequence classification. [15:25:55] Sequence classification in tabular format written to GCF_009025895.1_ASM902589v1_genomic_aggregated_classification.tsv. [15:25:55] geNomad aggregated-classification finished! ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Executing geNomad summary (v1.7.6). This will summarize the results across modules into a classification report. │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ Outputs: │ │ genomad_output/GCF_009025895.1_ASM902589v1_genomic_summary │ │ ├── GCF_009025895.1_ASM902589v1_genomic_summary.json (execution parameters) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_virus_summary.tsv (virus classification summary) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_plasmid_summary.tsv (plasmid classification summary) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_virus.fna (virus nucleotide FASTA file) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_plasmid.fna (plasmid nucleotide FASTA file) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_virus_proteins.faa (virus protein FASTA file) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_plasmid_proteins.faa (plasmid protein FASTA file) │ │ ├── GCF_009025895.1_ASM902589v1_genomic_virus_genes.tsv (virus gene annotation data) │ │ └── GCF_009025895.1_ASM902589v1_genomic_plasmid_genes.tsv (plasmid gene annotation data) │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ [15:25:55] Executing genomad summary. [15:25:55] Using scores from aggregated-classification. [15:25:55] 6 plasmid(s) and 0 virus(es) were identified. [15:25:55] Nucleotide sequences were written to GCF_009025895.1_ASM902589v1_genomic_plasmid.fna and GCF_009025895.1_ASM902589v1_genomic_virus.fna. [15:25:55] Protein sequences were written to GCF_009025895.1_ASM902589v1_genomic_plasmid_proteins.faa and GCF_009025895.1_ASM902589v1_genomic_virus_proteins.faa. [15:25:55] Gene annotation data was written to GCF_009025895.1_ASM902589v1_genomic_plasmid_genes.tsv and GCF_009025895.1_ASM902589v1_genomic_virus_genes.tsv. [15:25:55] Summary files were written to GCF_009025895.1_ASM902589v1_genomic_plasmid_summary.tsv and GCF_009025895.1_ASM902589v1_genomic_virus_summary.tsv. [15:25:55] geNomad summary finished!