Biowulf High Performance Computing at the NIH
snippy: rapid haploid variant calling and core genome alignment

snippy is a tool for rapid haploid variant calling and core genome alignment.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=16g --gres=lscratch:10
[user@cn3200 ~]$module load snippy  
[+] Loading snippy 4.4.1  ...
Here is how snippy can be run on a test example:
[user@cn3200 ~]$ cp $SNIPPY_DATA/* .
[user@cn3200 ~]$ snippy --cpus 16 --outdir mysnps --ref example.gbk  --R1  test_R1.fastq.gz --R2 test_R2.fastq.gz
[12:59:37] This is snippy 4.4.0
[12:59:37] Written by Torsten Seemann
[12:59:37] Obtained from https://github.com/tseemann/snippy
[12:59:37] Detected operating system: linux
[12:59:37] Enabling bundled linux tools.
[12:59:37] Found bwa - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/bwa
[12:59:37] Found bcftools - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/bcftools
[12:59:37] Found samtools - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/samtools
[12:59:37] Found java - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/java
[12:59:37] Found snpEff - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/snpEff
[12:59:37] Found samclip - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/samclip
[12:59:37] Found seqtk - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/seqtk
[12:59:37] Found parallel - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/parallel
[12:59:37] Found freebayes - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/freebayes
[12:59:37] Found freebayes-parallel - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/freebayes-parallel
[12:59:37] Found fasta_generate_regions.py - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/fasta_generate_regions.py
[12:59:37] Found vcfstreamsort - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/vcfstreamsort
[12:59:37] Found vcfuniq - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/vcfuniq
[12:59:37] Found vcffirstheader - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/vcffirstheader
[12:59:37] Found gzip - /usr/bin/gzip
[12:59:37] Found vt - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/vt
[12:59:37] Found snippy-vcf_to_tab - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/snippy-vcf_to_tab
[12:59:37] Found snippy-vcf_report - /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/snippy-vcf_report
[12:59:37] Checking version: samtools --version is >= 1.7 - ok, have 1.9
[12:59:37] Checking version: bcftools --version is >= 1.7 - ok, have 1.9
[12:59:37] Checking version: freebayes --version is >= 1.1 - ok, have 1.3
[12:59:37] Checking version: snpEff -version is >= 4.3 - ok, have 4.3
[12:59:37] Using reference: example.gbk
[12:59:37] Treating reference as 'genbank' format.
[12:59:37] Will use 16 CPU cores.
[12:59:37] Using read file: test_R1.fastq.gz
[12:59:37] Using read file: test_R2.fastq.gz
[12:59:37] Creating folder: mysnps
[12:59:37] Changing working directory: mysnps
[12:59:37] Creating reference folder: reference
[12:59:37] Extracting FASTA and GFF from reference.
[12:59:37] Wrote 3 sequences to ref.fa
[12:59:37] Wrote 300 features to ref.gff
[12:59:37] Creating reference/snpeff.config
[12:59:37] Freebayes will process 31 chunks of 10411 bp, 16 chunks at a time.
[12:59:37] Using BAM RG (Read Group) ID: mysnps
[12:59:37] Running: samtools faidx reference/ref.fa 2>> snps.log
[12:59:37] Running: bwa index reference/ref.fa 2>> snps.log
[12:59:37] Running: mkdir -p reference/genomes && cp -f reference/ref.fa reference/genomes/ref.fa 2>> snps.log[12:59:37] Running: ln -sf reference/ref.fa . 2>> snps.log
[12:59:37] Running: ln -sf reference/ref.fa.fai . 2>> snps.log
[12:59:37] Running: mkdir -p reference/ref && gzip -c reference/ref.gff > reference/ref/genes.gff.gz 2>> snps.log
[12:59:37] Running: snpEff build -c reference/snpeff.config -dataDir . -gff3 ref 2>> snps.log
[12:59:40] Running: bwa mem  -Y -M -R '@RG\tID:mysnps\tSM:mysnps' -t 16 reference/ref.fa test_R1.fastq.gz test_R2.fastq.gz | samclip --max 10 --ref reference/ref.fa.fai | samtools sort -n -l 0 -T /tmp --threads 16 -m 250M | samtools fixmate -m - - | samtools sort -l 0 -T /tmp --threads 16 -m 250M | samtools markdup -T /tmp -r -s - - > snps.bam 2>> snps.log
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[samclip] samclip 0.2 by Torsten Seemann (@torstenseemann)
[samclip] Loading: reference/ref.fa.fai
[samclip] Found 3 sequences in reference/ref.fa.fai
[M::process] read 25626 sequences (3752821 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 25626 reads in 2.914 CPU sec, 0.736 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -Y -M -R @RG\tID:mysnps\tSM:mysnps -t 16 reference/ref.fa test_R1.fastq.gz test_R2.fastq.gz
[main] Real time: 0.869 sec; CPU: 2.994 sec
[samclip] Total SAM records 25626, removed 0, allowed 0, passed 25626
[samclip] Header contained 5 lines
[samclip] Done.
[bam_sort_core] merging from 0 files and 16 in-memory blocks...
[bam_sort_core] merging from 0 files and 16 in-memory blocks...
[12:59:41] Running: samtools index snps.bam 2>> snps.log
[12:59:41] Running: fasta_generate_regions.py reference/ref.fa.fai 10411 > reference/ref.txt 2>> snps.log
[12:59:41] Running: freebayes-parallel reference/ref.txt 16 -p 2 -P 0 -C 10 --min-repeat-entropy 1.5 --strict-vcf -q 13 -m 60 --min-coverage 10 -F 0.05  -f reference/ref.fa snps.bam > snps.raw.vcf 2>> snps.log
[12:59:41] Running: freebayes-parallel reference/ref.txt 16 -p 2 -P 0 -C 10 --min-repeat-entropy 1.5 --strict-
vcf -q 13 -m 60 --min-coverage 10 -F 0.05  -f reference/ref.fa snps.bam > snps.raw.vcf 2>> snps.log
[12:59:42] Running: bcftools view --include 'FMT/GT="1/1" && QUAL>=100 && FMT/DP>=10 && (FMT/AO)/(FMT/DP)>=0' snps.raw.vcf  | vt normalize -r reference/ref.fa - | bcftools annotate --remove '^INFO/TYPE,^INFO/DP,^INFO/RO,^INFO/AO,^INFO/AB,^FORMAT/GT,^FORMAT/DP,^FORMAT/RO,^FORMAT/AO,^FORMAT/QR,^FORMAT/QA,^FORMAT/GL' > snps.filt.vcf 2>> snps.log
normalize v0.5

options:     input VCF file                                  -
         [o] output VCF file                                 -
         [w] sorting window size                             10000
         [n] no fail on reference inconsistency for non SNPs false
         [q] quiet                                           false
         [d] debug                                           false
         [r] reference FASTA file                            reference/ref.fa


stats: biallelic
          no. left trimmed                      : 0
          no. right trimmed                     : 0
          no. left and right trimmed            : 0
          no. right trimmed and left aligned    : 0
          no. left aligned                      : 0

       total no. biallelic normalized           : 0

       multiallelic
          no. left trimmed                      : 0
          no. right trimmed                     : 0
          no. left and right trimmed            : 0
          no. right trimmed and left aligned    : 0
          no. left aligned                      : 0

       total no. multiallelic normalized        : 0

       total no. variants normalized            : 0
       total no. variants observed              : 0
       total no. reference observed             : 0

Time elapsed: 0.00s

[12:59:42] Running: snpEff ann -noLog -noStats -no-downstream -no-upstream -no-utr -t -c reference/snpeff.config -dataDir . ref snps.filt.vcf > snps.vcf 2>> snps.log
[12:59:42] Running: /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/snippy-vcf_to_tab --gff reference/ref.gff --ref reference/ref.fa --vcf snps.vcf > snps.tab 2>> snps.log
[12:59:43] Running: /usr/local/Anaconda/envs_app/snippy/4.4.1/bin/snippy-vcf_extract_subs snps.filt.vcf > snps.subs.vcf 2>> snps.log
[12:59:43] Running: bcftools convert -Oz -o snps.vcf.gz snps.vcf 2>> snps.log
[12:59:43] Running: bcftools index -f snps.vcf.gz 2>> snps.log
[12:59:43] Running: bcftools consensus -f reference/ref.fa -o snps.consensus.fa snps.vcf.gz 2>> snps.log
[12:59:43] Running: bcftools convert -Oz -o snps.subs.vcf.gz snps.subs.vcf 2>> snps.log
[12:59:43] Running: bcftools index -f snps.subs.vcf.gz 2>> snps.log
[12:59:43] Running: bcftools consensus -f reference/ref.fa -o snps.consensus.subs.fa snps.subs.vcf.gz 2>> snps.log
[12:59:43] Running: rm -f snps.subs.vcf.gz snps.subs.vcf.gz.csi snps.subs.vcf.gz.tbi 2>> snps.log
[12:59:43] Generating reference aligned/masked FASTA relative to reference: snps.aligned.fa
[12:59:43] Marked 0 heterozygous sites with 'n'
[12:59:43] Creating extra output files: BED GFF CSV TXT HTML
[12:59:43] Identified 0 variants.
[12:59:43] Result folder: mysnps
[12:59:43] Result files:
[12:59:43] * mysnps/snps.aligned.fa
[12:59:43] * mysnps/snps.bam
[12:59:43] * mysnps/snps.bam.bai
[12:59:43] * mysnps/snps.bed
[12:59:43] * mysnps/snps.consensus.fa
[12:59:43] * mysnps/snps.consensus.subs.fa
[12:59:43] * mysnps/snps.csv
[12:59:43] * mysnps/snps.filt.vcf
[12:59:43] * mysnps/snps.gff
[12:59:43] * mysnps/snps.html
[12:59:43] * mysnps/snps.log
[12:59:43] * mysnps/snps.raw.vcf
[12:59:43] * mysnps/snps.subs.vcf
[12:59:43] * mysnps/snps.tab
[12:59:43] * mysnps/snps.txt
[12:59:43] * mysnps/snps.vcf
[12:59:43] * mysnps/snps.vcf.gz
[12:59:43] * mysnps/snps.vcf.gz.csi
[12:59:43] Walltime used: 6 seconds
[12:59:43] May the SNPs be with you.
[12:59:43] Done.
An output folder mysnps is created:
[user@cn3200 ~]$ tree mysnps
mysnps
├ reference
│   ├ genomes
│   │   └ ref.fa
│   ├ ref
│   │   ├ genes.gff.gz
│   │   └ snpEffectPredictor.bin
│   ├ ref.fa
│   ├ ref.fa.amb
│   ├ ref.fa.ann
│   ├ ref.fa.bwt
│   ├ ref.fa.fai
│   ├ ref.fa.pac
│   ├ ref.fa.sa
│   ├ ref.gff
│   ├ ref.txt
│   └ snpeff.config
├ ref.fa -> reference/ref.fa
├ ref.fa.fai -> reference/ref.fa.fai
├ snps.aligned.fa
├ snps.bam
├ snps.bam.bai
├ snps.bed
├ snps.consensus.fa
├ snps.consensus.subs.fa
├ snps.csv
├ snps.filt.vcf
├ snps.gff
├ snps.html
├ snps.log
├ snps.raw.vcf
├ snps.subs.vcf
├ snps.tab
├ snps.txt
├ snps.vcf
├ snps.vcf.gz
└ snps.vcf.gz.csi

3 directories, 33 filesa
End the interactive session:
[user@cn3200 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$