Biowulf High Performance Computing at the NIH
viral-ngs: viral genomics analysis pipelines

viral-ngs comprises a set of viral genomics analysis pipelines developed by the Broad Institute

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=16g --cpus-per-task=8
First, make sure you installed conda in the folder /usr/$USER/conda, as described in https://hpc.nih.gov/apps/python.html. Now enter the commands:
[user@cn3335 ~]$ source /data/$USER/conda/etc/profile.d/conda.sh
[user@cn3335 ~]$ module load viral-ngs 
[+] Loading gcc  7.3.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[-] Unloading gcc  7.3.0  ... 
[+] Loading gcc  7.3.0  ... 
[+] Loading openmpi 3.0.2  for GCC 7.3.0 
[+] Loading ImageMagick  7.0.8  on cn3613 
[+] Loading HDF5  1.10.4 
[+] Loading R 3.5.2 
[+] Loading GATK  3.6 
[+] Loading python 3.6
[+] Loading viral-ngs  1.22.1 
The viral-ngs processing
comprises the following 14 pipelines/commands:
[user@cn3335 ~]$ ls $VIRAL_NGS_BIN | grep .py
assembly.py
broad_utils.py
file_utils.py
illumina.py
install_tools.py
interhost.py
intrahost.py
kmer_utils.py
metagenomics.py
ncbi.py
read_utils.py
reports.py
smtpd.py
taxon_filter.py
Each the command comes with a set of subcommands. In order to view a list of subcommands for a given command, type the command name (optionally) followed by "--help".

For example:
[user@cn3335 ~]$assembly.py --help
...
subcommands:
    trim_rmdup_subsamp
    assemble_trinity
    assemble_spades
    gapfill_gap2seq
    order_and_orient
    impute_from_reference
    refine_assembly
    filter_short_seqs
    modify_contig
    vcf_to_fasta
    trim_fasta
The general pattern of a command usage is as follows:

[user@cn3335 ~]$ command_name subcommand_name positional_arguments [ optional_arguments ] 
In order to see a list of valid arguments for a given command/subcommand combination, type
[user@cn3335 ~]$ command_name subcommand_name [ --help ] 
For example:
[user@cn3335 ~]$assembly.py assemble_spades 
usage: assembly.py subcommand assemble_spades [-h]
                                              [--contigsTrusted CONTIGS_TRUSTED]
                                              [--contigsUntrusted CONTIGS_UNTRUSTED]
                                              [--nReads N_READS]
                                              [--outReads OUTREADS]
                                              [--filterContigs]
                                              [--alwaysSucceed]
                                              [--minContigLen MIN_CONTIG_LEN]
                                              [--spadesOpts SPADES_OPTS]
                                              [--memLimitGb MEM_LIMIT_GB]
                                              [--threads THREADS]
                                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                              [--version] [--tmp_dir TMP_DIR]
                                              [--tmp_dirKeep]
                                              in_bam clip_db out_fasta

De novo RNA-seq assembly with the SPAdes assembler.

positional arguments:
  in_bam                Input unaligned reads, BAM format. May include both
                        paired and unpaired reads.
  clip_db               Trimmomatic clip db
  out_fasta             Output assembled contigs. Note that, since this is
                        RNA-seq assembly, for each assembled genomic region
                        there may be several contigs representing different
                        variants of that region.

optional arguments:
  -h, --help            show this help message and exit
  --contigsTrusted CONTIGS_TRUSTED
                        Optional input contigs of high quality, previously
                        assembled from the same sample (default: None)
...
Let's practice with the viral-ngs commands on available sample data. First, download the data from a system folder to your current folder:
[user@cn3335 ~]$cp -r $VIRAL_NGS_DATA/* .
[user@cn3335 ~]$ls
5kb_human_from_chr6.fasta  TestBmtaggerDbBuild       TestOrderOrientAndImputeFromReference
almost-empty-2.bam         TestDepleteBlastnBam      TestPerSample
almost-empty.bam           TestDepleteHuman          TestPurgeUnmated
broken.bam                 TestDifficultSampleNames  TestRefineAssembly
ebola.fasta                TestFastaFetch            TestRmdupUnaligned
ebola.fasta.gz             TestFastqBam              TestRunInfo
ebola.fasta.lz4            TestFeatureReader         TestSampleSheet
ebov-makona.fasta          TestFeatureTableFetch     TestSnpEff
empty.bam                  TestFeatureTransfer       TestSplitReads
empty.fasta                TestFilterLastal          TestTarballMerger
G5012.3.fasta              TestGap2Seq               TestTaxonomy
G5012.3.mini.bam           TestGenbankRecordFetch    TestToolKrakenExecute
G5012.3.subset.bam         TestIlluminaDir           TestToolNovoalign
G5012.3.testreads.bam      TestImputeFromReference   TestToolPicard
one_gene.vcf.gz            TestKmers                 TestToolSamtools
one_gene.vcf.gz.tbi        TestLastalDbBuild         TestTrimmomatic
README.md                  TestManualSnpCaller       TestUtilMisc
TestAssembleSpades         TestMetagenomicsSimple    TestVPhaser2
TestAssembleTrinity        TestMetagenomicsViralMix  viral-ngs
TestBamFilter              TestMiseqToBam            WDL
TestBlastnDbBuild          TestMvicuna
TestBmtagger               TestOrderAndOrient
Note that the name of each the sample data folder matches the name of a certain command/subcommand pair.

The following example illustrates the usage of the assembly.py command with assemble_spades subcommand:
[user@cn3335 ~]$assembly.py assemble_spades G5012.3.testreads.bam TestAssembleSpades/clipDb.fasta myTest.fasta --tmp_dir  . --threads=8
2019-02-19 10:48:30,429 - cmd:197:main_argparse - INFO - software version: 1.22.1, python version: 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:43:46) 
[GCC 7.3.0]
...
Multiple cores found: Using 2 threads
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGCCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'TCTATAGTTTAGGTAACTTTGTGTTTGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCGGTTCAGCAGGAATGC'
Using Long Clipping Sequence: 'AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCCTCTACGTCTCGTGGGCTCGG'
Skipping duplicate Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTATATCTATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT'
Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
ILLUMINACLIP: Using 2 prefix pairs, 21 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 9355 Both Surviving: 9280 (99.20%) Forward Only Surviving: 51 (0.55%) Reverse Only Surviving: 24 (0.26%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done          
Clean up empty files
	done
Input and filter stats:
	Input sequences (file 1): 9,280
	Input bases (file 1): 934,697
	Input mean length (file 1): 100.72
	Input sequences (file 2): 9,280
	Input bases (file 2): 928,905
	Input mean length (file 2): 100.10
	Good sequences (pairs): 9,274
	Good bases (pairs): 1,862,558
	Good mean length (pairs): 200.84
	Good sequences (singletons file 1): 6 (0.06%)
	Good bases (singletons file 1): 606
	Good mean length (singletons file 1): 101.00
	Good sequences (singletons file 2): 0 (0.00%)
	Bad sequences (file 1): 0 (0.00%)
	Bad sequences (file 2): 6 (0.06%)
	Bad bases (file 2): 438
	Bad mean length (file 2): 73.00
	Sequences filtered by specified parameters:
	ns_max_n: 6
...
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done          
Clean up empty files
	done
Input and filter stats:
	Input sequences: 81
	Input bases: 8,036
	Input mean length: 99.21
	Good sequences: 76 (93.83%)
	Good bases: 7,531
	Good mean length: 99.09
	Bad sequences: 5 (6.17%)
	Bad bases: 505
	Bad mean length: 101.00
	Sequences filtered by specified parameters:
	derep: 5
...
System information:
  SPAdes version: 3.12.0
  Python version: 3.6.6
  OS: Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core
...

======= SPAdes pipeline finished.
...
Thank you for using SPAdes!
The output will stored in the file myTest.fasta.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. viral-ngs.sh). For example:

#!/bin/bash
module load viral-ngs
   [ optional_arguments ]

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] viral-ngs.sh