viral-ngs: viral genomics analysis pipelines
viral-ngs comprises a set of viral genomics analysis pipelines developed by the Broad Institute
Documentation
Important Notes
- Module Name: viral-ngs (see the modules page for more information)
- Unusual environment variables set
- VIRAL_NGS_HOME installation directory
- VIRAL_NGS_BIN executable directory
- VIRAL_NGS_SRC source code directory
- VIRAL_NGS_DATA sample data directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=16g --cpus-per-task=8First, make sure you installed conda in the folder /usr/$USER/conda, as described in https://hpc.nih.gov/apps/python.html. Now enter the commands:
[user@cn3335 ~]$ source /data/$USER/conda/etc/profile.d/conda.sh [user@cn3335 ~]$ module load viral-ngs [+] Loading gcc 7.3.0 ... [+] Loading GSL 2.4 for GCC 7.2.0 ... [-] Unloading gcc 7.3.0 ... [+] Loading gcc 7.3.0 ... [+] Loading openmpi 3.0.2 for GCC 7.3.0 [+] Loading ImageMagick 7.0.8 on cn3613 [+] Loading HDF5 1.10.4 [+] Loading R 3.5.2 [+] Loading GATK 3.6 [+] Loading python 3.6 [+] Loading viral-ngs 1.22.1The viral-ngs processing

comprises the following 14 pipelines/commands:
[user@cn3335 ~]$ ls $VIRAL_NGS_BIN | grep .py assembly.py broad_utils.py file_utils.py illumina.py install_tools.py interhost.py intrahost.py kmer_utils.py metagenomics.py ncbi.py read_utils.py reports.py smtpd.py taxon_filter.pyEach the command comes with a set of subcommands. In order to view a list of subcommands for a given command, type the command name (optionally) followed by "--help".
For example:
[user@cn3335 ~]$assembly.py --help ... subcommands: trim_rmdup_subsamp assemble_trinity assemble_spades gapfill_gap2seq order_and_orient impute_from_reference refine_assembly filter_short_seqs modify_contig vcf_to_fasta trim_fastaThe general pattern of a command usage is as follows:
[user@cn3335 ~]$ command_name subcommand_name positional_arguments [ optional_arguments ]In order to see a list of valid arguments for a given command/subcommand combination, type
[user@cn3335 ~]$ command_name subcommand_name [ --help ]For example:
[user@cn3335 ~]$assembly.py assemble_spades usage: assembly.py subcommand assemble_spades [-h] [--contigsTrusted CONTIGS_TRUSTED] [--contigsUntrusted CONTIGS_UNTRUSTED] [--nReads N_READS] [--outReads OUTREADS] [--filterContigs] [--alwaysSucceed] [--minContigLen MIN_CONTIG_LEN] [--spadesOpts SPADES_OPTS] [--memLimitGb MEM_LIMIT_GB] [--threads THREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] in_bam clip_db out_fasta De novo RNA-seq assembly with the SPAdes assembler. positional arguments: in_bam Input unaligned reads, BAM format. May include both paired and unpaired reads. clip_db Trimmomatic clip db out_fasta Output assembled contigs. Note that, since this is RNA-seq assembly, for each assembled genomic region there may be several contigs representing different variants of that region. optional arguments: -h, --help show this help message and exit --contigsTrusted CONTIGS_TRUSTED Optional input contigs of high quality, previously assembled from the same sample (default: None) ...Let's practice with the viral-ngs commands on available sample data. First, download the data from a system folder to your current folder:
[user@cn3335 ~]$cp -r $VIRAL_NGS_DATA/* . [user@cn3335 ~]$ls 5kb_human_from_chr6.fasta TestBmtaggerDbBuild TestOrderOrientAndImputeFromReference almost-empty-2.bam TestDepleteBlastnBam TestPerSample almost-empty.bam TestDepleteHuman TestPurgeUnmated broken.bam TestDifficultSampleNames TestRefineAssembly ebola.fasta TestFastaFetch TestRmdupUnaligned ebola.fasta.gz TestFastqBam TestRunInfo ebola.fasta.lz4 TestFeatureReader TestSampleSheet ebov-makona.fasta TestFeatureTableFetch TestSnpEff empty.bam TestFeatureTransfer TestSplitReads empty.fasta TestFilterLastal TestTarballMerger G5012.3.fasta TestGap2Seq TestTaxonomy G5012.3.mini.bam TestGenbankRecordFetch TestToolKrakenExecute G5012.3.subset.bam TestIlluminaDir TestToolNovoalign G5012.3.testreads.bam TestImputeFromReference TestToolPicard one_gene.vcf.gz TestKmers TestToolSamtools one_gene.vcf.gz.tbi TestLastalDbBuild TestTrimmomatic README.md TestManualSnpCaller TestUtilMisc TestAssembleSpades TestMetagenomicsSimple TestVPhaser2 TestAssembleTrinity TestMetagenomicsViralMix viral-ngs TestBamFilter TestMiseqToBam WDL TestBlastnDbBuild TestMvicuna TestBmtagger TestOrderAndOrientNote that the name of each the sample data folder
The following example illustrates the usage of the assembly.py command with assemble_spades subcommand:
[user@cn3335 ~]$assembly.py assemble_spades G5012.3.testreads.bam TestAssembleSpades/clipDb.fasta myTest.fasta --tmp_dir . --threads=8 2019-02-19 10:48:30,429 - cmd:197:main_argparse - INFO - software version: 1.22.1, python version: 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:43:46) [GCC 7.3.0] ... Multiple cores found: Using 2 threads Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG' Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG' Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG' Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGCCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA' Using Long Clipping Sequence: 'TCTATAGTTTAGGTAACTTTGTGTTTGA' Using Long Clipping Sequence: 'GATCGGAAGAGCGGTTCAGCAGGAATGC' Using Long Clipping Sequence: 'AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG' Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCCTCTACGTCTCGTGGGCTCGG' Skipping duplicate Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT' Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTATATCTATCTCGTATGCCGTCTTCTGCTTG' Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG' Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC' Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT' Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA' Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG' Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTC' Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC' Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT' Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA' ILLUMINACLIP: Using 2 prefix pairs, 21 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs: 9355 Both Surviving: 9280 (99.20%) Forward Only Surviving: 51 (0.55%) Reverse Only Surviving: 24 (0.26%) Dropped: 0 (0.00%) TrimmomaticPE: Completed successfully Estimate size of input data for status report (this might take a while for large files) done Parse and process input data done Clean up empty files done Input and filter stats: Input sequences (file 1): 9,280 Input bases (file 1): 934,697 Input mean length (file 1): 100.72 Input sequences (file 2): 9,280 Input bases (file 2): 928,905 Input mean length (file 2): 100.10 Good sequences (pairs): 9,274 Good bases (pairs): 1,862,558 Good mean length (pairs): 200.84 Good sequences (singletons file 1): 6 (0.06%) Good bases (singletons file 1): 606 Good mean length (singletons file 1): 101.00 Good sequences (singletons file 2): 0 (0.00%) Bad sequences (file 1): 0 (0.00%) Bad sequences (file 2): 6 (0.06%) Bad bases (file 2): 438 Bad mean length (file 2): 73.00 Sequences filtered by specified parameters: ns_max_n: 6 ... Estimate size of input data for status report (this might take a while for large files) done Parse and process input data done Clean up empty files done Input and filter stats: Input sequences: 81 Input bases: 8,036 Input mean length: 99.21 Good sequences: 76 (93.83%) Good bases: 7,531 Good mean length: 99.09 Bad sequences: 5 (6.17%) Bad bases: 505 Bad mean length: 101.00 Sequences filtered by specified parameters: derep: 5 ... System information: SPAdes version: 3.12.0 Python version: 3.6.6 OS: Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core ... ======= SPAdes pipeline finished. ... Thank you for using SPAdes!The output will stored in the file myTest.fasta.
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. viral-ngs.sh). For example:
#!/bin/bash module load viral-ngs[ optional_arguments ]
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] viral-ngs.sh