SyRI is a comprehensive tool for predicting genomic differences between related genomes using whole-genome assemblies (WGA). The assemblies are aligned using whole-genome alignment tools, and these alignments are then used as input to SyRI. SyRI identifies syntenic path (longest set of co-linear regions), structural rearrangements (inversions, translocations, and duplications), local variations (SNPs, indels, CNVs etc) within syntenic and structural rearrangements, and un-aligned regions.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load syri [+] Loading syri 1.6.3 on cn4294 [+] Loading singularity 4.0.1 on cn4294 [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226
Running help command:
[user@cn4338]$ syri -h usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B,P}] [-f] [-k] [--dir DIR] [--prefix PREFIX] [--seed SEED] [--nc NCORES] [--novcf] [--nosr] [--invgaplen INVGL] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT] [--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET] [--cigar] [-s SSPATH] [--hdrseq] [--maxsize MAXS] [--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--version] Input Files: -c INFILE File containing alignment coordinates (default: None) -r REF Genome A (which is considered as reference for the alignments). Required for local variation (large indels, CNVs) identification. (default: None) -q QRY Genome B (which is considered as query for the alignments). Required for local variation (large indels, CNVs) identification. (default: None) -d DELTA .delta file from mummer. Required for short variation (SNPs/indels) identification when CIGAR string is not available (default: None) Additional arguments: -F {T,S,B,P} Input file type. T: Table, S: SAM, B: BAM, P: PAF (default: T) -f As a default, syri filters out low quality and small alignments. Use this parameter to use the full list of alignments without any filtering. (default: True) -k Keep intermediate output files (default: False) --dir DIR path to working directory (if not current directory). All files must be in this directory. (default: None) --prefix PREFIX Prefix to add before the output file Names (default: ) --seed SEED seed for generating random numbers (default: 1) --nc NCORES number of cores to use in parallel (max is number of chromosomes) (default: 1) --novcf Do not combine all files into one output file (default: False) SR identification: --nosr Set to skip structural rearrangement identification (default: False) --invgaplen INVGL Maximum allowed gap-length between two alignments of a multi-alignment inversion. It affects the selection of large inversions that can have different length in the reference and query genomes. (default: 1000000000) --tdgaplen TDGL Maximum allowed gap-length between two alignments of a multi-alignment translocation or duplication (TD). Larger values increases TD identification sensitivity but also runtime. (default: 500000) --tdmaxolp TDOLP Maximum allowed overlap between two translocations. Value should be in range (0,1]. (default: 0.8) -b BRUTERUNTIME Cutoff to restrict brute force methods to take too much time (in seconds). Smaller values would make algorithm faster, but could have marginal effects on accuracy. In general case, would not be required. (default: 60) --unic TRANSUNICOUNT Number of uniques bps for selecting translocation. Smaller values would select smaller TLs better, but may increase time and decrease accuracy. (default: 1000) --unip TRANSUNIPERCENT Percent of unique region requried to select translocation. Value should be in range (0,1]. Smaller values would allow selection of TDs which are more overlapped with other regions. (default: 0.5) --inc INCREASEBY Minimum score increase required to add another alignment to translocation cluster solution (default: 1000) --no-chrmatch Do not allow SyRI to automatically match chromosome ids between the two genomes if they are not equal (default: False) ShV identification: --nosv Set to skip structural variation identification (default: False) --nosnp Set to skip SNP/Indel (within alignment) identification (default: False) --all Use duplications too for variant identification (default: False) --allow-offset OFFSET BPs allowed to overlap (default: 5) --cigar Find SNPs/indels using CIGAR string. Necessary for alignments generated using aligners other than nucmers (default: False) -s SSPATH path to show-snps from mummer (default: show-snps) --hdrseq Output highly-diverged regions (HDRs) sequence. (default: False) --maxsize MAXS Max size for printing sequence of large SVs (insertions, deletions and HDRs). Only affect printing (.out/.vcf file) and not the selection. SVs larger than this value would be printed as symbolic SVs. For no cut-off use -1. (default: -1) optional arguments: -h, --help show this help message and exit --log {DEBUG,INFO,WARN} log level (default: INFO) --lf LOG_FIN Name of log file (default: syri.log) --version show program's version number and exit
Example
##An example pipeline for running SyRI ## Get Yeast Reference genome [user@cn4338] wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_genomic.fna.gz [user@cn4338] gzip -df GCA_000146045.2_R64_genomic.fna.gz ## Get Query genome [user@cn4338] wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/977/955/GCA_000977955.2_Sc_YJM1447_v1/GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.gz [user@cn4338] gzip -df GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.gz [user@cn4338]$gzip -d TAIR10_GFF3_genes.gff.gz ## Remove mitochondria [user@cn4338]$head -151797 GCA_000977955.2_Sc_YJM1447_v1_genomic.fna > GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered ## Rename for clarity [user@cn4338] ln -sf GCA_000146045.2_R64_genomic.fna refgenome [user@cn4338] ln -sf GCA_000977955.2_Sc_YJM1447_v1_genomic.fna.filtered qrygenome ## Perform whole genome alignment # Using minimap2 for generating alignment. Any other whole genome alignment tool can also be used. [user@cn4338] module load minimap2 [+] Loading minimap2, version 2.26... [user@cn4338] minimap2 -ax asm5 --eqx refgenome qrygenome > out.sam [M::mm_idx_gen::0.219*0.87] collected minimizers [M::mm_idx_gen::0.283*1.12] sorted minimizers [M::main::0.283*1.12] loaded/built the index for 16 target sequence(s) [M::mm_mapopt_update::0.296*1.11] mid_occ = 50 [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 16 [M::mm_idx_stat::0.306*1.11] distinct minimizers: 1142213 (98.22% are singletons); average occurrences: 1.058; average spacing: 9.990; total length: 12071326 [M::worker_pipeline::4.370*1.87] mapped 16 sequences [M::main] Version: 2.26-r1175 [M::main] CMD: minimap2 -ax asm5 --eqx refgenome qrygenome [M::main] Real time: 4.377 sec; CPU: 8.193 sec; Peak RSS: 0.898 GB ## Run SyRI with SAM or BAM file as input [user@cn4338] syri -c out.sam -r refgenome -q qrygenome -k -F S Reading Coords - WARNING - Chromosomes IDs do not match. Reading Coords - WARNING - Matching them automatically. For each reference genome, most similar query genome will be selected. Check mapids.txt for mapping used. TLOut.txt is empty. Skipping analysing it. invTLOut.txt is empty. Skipping analysing it. TL Out.txt is empty. Skipping analysing it. invTL Out.txt is empty. Skipping analysing it. TLOut.txt is empty. Skipping analysing it. invTLOut.txt is empty. Skipping analysing it. local_variation - WARNING - Finished syri
For more examples, please visit the Syri Example Page |