Syri on Biowulf

Quick Links

SyRI is a comprehensive tool for predicting genomic differences between related genomes using whole-genome assemblies (WGA). The assemblies are aligned using whole-genome alignment tools, and these alignments are then used as input to SyRI. SyRI identifies syntenic path (longest set of co-linear regions), structural rearrangements (inversions, translocations, and duplications), local variations (SNPs, indels, CNVs etc) within syntenic and structural rearrangements, and un-aligned regions.

References:

Goel, M., Sun, H., Jiao, WB. et al.

SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies.

Genome Biol 20, 277 (2019).

Documentation

Important Notes

Module Name: syri (see the modules page for more information)

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load syri
[+] Loading syri  1.6.3  on cn4294
[+] Loading singularity  4.0.1  on cn4294
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226

Example

Most jobs should be run as batch jobs.

Running help command:

  [user@cn4338]$ syri -h
  usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B,P}] [-f] [-k] [--dir DIR] [--prefix PREFIX]
  [--seed SEED] [--nc NCORES] [--novcf] [--nosr] [--invgaplen INVGL] [--tdgaplen TDGL]
  [--tdmaxolp TDOLP] [-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT]
  [--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET] [--cigar]
  [-s SSPATH] [--hdrseq] [--maxsize MAXS] [--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--version]

  Input Files:
  -c INFILE File containing alignment coordinates (default: None)
  -r REF Genome A (which is considered as reference for the alignments). Required for local
  variation (large indels, CNVs) identification. (default: None)
  -q QRY Genome B (which is considered as query for the alignments). Required for local variation
  (large indels, CNVs) identification. (default: None)
  -d DELTA .delta file from mummer. Required for short variation (SNPs/indels) identification when
  CIGAR string is not available (default: None)

  Additional arguments:
  -F {T,S,B,P} Input file type. T: Table, S: SAM, B: BAM, P: PAF (default: T)
  -f As a default, syri filters out low quality and small alignments. Use this parameter to
  use the full list of alignments without any filtering. (default: True)
  -k Keep intermediate output files (default: False)
  --dir DIR path to working directory (if not current directory). All files must be in this
  directory. (default: None)
  --prefix PREFIX Prefix to add before the output file Names (default: )
  --seed SEED seed for generating random numbers (default: 1)
  --nc NCORES number of cores to use in parallel (max is number of chromosomes) (default: 1)
  --novcf Do not combine all files into one output file (default: False)

  SR identification:
  --nosr Set to skip structural rearrangement identification (default: False)
  --invgaplen INVGL Maximum allowed gap-length between two alignments of a multi-alignment inversion. It
  affects the selection of large inversions that can have different length in the
  reference and query genomes. (default: 1000000000)
  --tdgaplen TDGL Maximum allowed gap-length between two alignments of a multi-alignment translocation or
  duplication (TD). Larger values increases TD identification sensitivity but also
  runtime. (default: 500000)
  --tdmaxolp TDOLP Maximum allowed overlap between two translocations. Value should be in range (0,1].
  (default: 0.8)
  -b BRUTERUNTIME Cutoff to restrict brute force methods to take too much time (in seconds). Smaller
  values would make algorithm faster, but could have marginal effects on accuracy. In
  general case, would not be required. (default: 60)
  --unic TRANSUNICOUNT Number of uniques bps for selecting translocation. Smaller values would select smaller
  TLs better, but may increase time and decrease accuracy. (default: 1000)
  --unip TRANSUNIPERCENT
  Percent of unique region requried to select translocation. Value should be in range
  (0,1]. Smaller values would allow selection of TDs which are more overlapped with other
  regions. (default: 0.5)
  --inc INCREASEBY Minimum score increase required to add another alignment to translocation cluster
  solution (default: 1000)
  --no-chrmatch Do not allow SyRI to automatically match chromosome ids between the two genomes if they
  are not equal (default: False)

  ShV identification:
  --nosv Set to skip structural variation identification (default: False)
  --nosnp Set to skip SNP/Indel (within alignment) identification (default: False)
  --all Use duplications too for variant identification (default: False)
  --allow-offset OFFSET
  BPs allowed to overlap (default: 5)
  --cigar Find SNPs/indels using CIGAR string. Necessary for alignments generated using aligners other than nucmers (default: False)
  -s SSPATH path to show-snps from mummer (default: show-snps)
  --hdrseq Output highly-diverged regions (HDRs) sequence. (default: False)
  --maxsize MAXS Max size for printing sequence of large SVs (insertions, deletions and HDRs). Only
  affect printing (.out/.vcf file) and not the selection. SVs larger than this value would
  be printed as symbolic SVs. For no cut-off use -1. (default: -1)

  optional arguments:
  -h, --help show this help message and exit
  --log {DEBUG,INFO,WARN}
  log level (default: INFO)
  --lf LOG_FIN Name of log file (default: syri.log)
  --version show program's version number and exit

Example

##An example pipeline for running SyRI
## Get Yeast Reference genome
[user@cn4338] wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_genomic.fna.gz
[user@cn4338]  gzip -df GCA_000146045.2_R64_genomic.fna.gz
## Get Query genome
[user@cn4338] wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/977/955/GCA_000977955.2_Sc_YJM1447_v1/GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.gz
[user@cn4338] gzip -df GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.gz
[user@cn4338]$gzip -d TAIR10_GFF3_genes.gff.gz 
## Remove mitochondria
[user@cn4338]$head -151797 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna > GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered

## Rename for clarity
[user@cn4338] ln -sf GCA_000146045.2_R64_genomic.fna refgenome
[user@cn4338] ln -sf GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered qrygenome

## Perform whole genome alignment
# Using minimap2 for generating alignment. Any other whole genome alignment tool can also be used.
[user@cn4338] module load minimap2
[+] Loading minimap2, version 2.26...

[user@cn4338] minimap2 -ax asm5 --eqx refgenome qrygenome > out.sam
[M::mm_idx_gen::0.219*0.87] collected minimizers
[M::mm_idx_gen::0.283*1.12] sorted minimizers
[M::main::0.283*1.12] loaded/built the index for 16 target sequence(s)
[M::mm_mapopt_update::0.296*1.11] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 16
[M::mm_idx_stat::0.306*1.11] distinct minimizers: 1142213 (98.22% are singletons); average occurrences: 1.058; average spacing: 9.990; total length: 12071326
[M::worker_pipeline::4.370*1.87] mapped 16 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -ax asm5 --eqx refgenome qrygenome
[M::main] Real time: 4.377 sec; CPU: 8.193 sec; Peak RSS: 0.898 GB

## Run SyRI with SAM or BAM file as input
[user@cn4338]  syri -c out.sam -r refgenome -q qrygenome -k -F S
Reading Coords - WARNING - Chromosomes IDs do not match.
Reading Coords - WARNING - Matching them automatically. For each reference genome, most similar query genome will be selected. Check mapids.txt for mapping used.
TLOut.txt  is empty. Skipping analysing it.
invTLOut.txt  is empty. Skipping analysing it.
TL Out.txt is empty. Skipping analysing it.
invTL Out.txt is empty. Skipping analysing it.
TLOut.txt  is empty. Skipping analysing it.
invTLOut.txt  is empty. Skipping analysing it.
local_variation - WARNING - Finished syri

For more examples, please visit the Syri Example Page