Iterative error correction of long 250 or 300 bp Illumina reads minimizes the total amount of erroneous reads, which improves contig assembly. This pipeline runs multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this pipeline is able to correct more base substitution errors, especially in repeats. The final overlap-based correction round can also correct small insertions and deletions.
References:
- Sameith K, Roscito J, Hiller M (2016). Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Briefings in Bioinformatics, doi: 10.1093/bib/bbw003
- Module Name: SGA-ICE (see the modules page for more information)
- Multithreaded app (use -t option)
- Example files in /usr/local/apps/SGA-ICE/TEST_DATA
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive -c 8 --mem=20g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load SGA-ICE [user@cn3144 ~]$ cd /data/$USER [user@cn3144 ~]$ cp -r /usr/local/apps/SGA-ICE/TEST_DATA . [user@cn3144 ~]$ SGA-ICE.py /data/$USER/TEST_DATA -k 40,60,100,125,150,200 -t 8 --noCleanup --noOvlCorr --scriptName test_sga.sh ### Input directory is: TEST_DATA/ ### Number of threads is: 8 ### User provided k-mers are: [40, 60, 100, 125, 150, 200] ### User chose not to do overlap correction ### Temporary files will not be deleted ### Error rate for sga correct 0.01 ### Minimum overlap for sga correct 40 Shell script is called: test_sga.sh ### These are the fastq files found in /data/teacher/TEST_DATA/: SRR6219624_1.fastq SRR6219624_2.fastq ### Your reads are 301 bp long, calculated from file: /data/teacher/TEST_DATA/SRR6219624_1.fastq To run 6 correction round with k=40,60,100,125,150,200 using 8 threads, execute /data/teacher/TEST_DATA//test_sga.sh [user@cn3144 ~]$ cd TEST_DATA [user@cn3144 ~]$ ./test_sga.sh Temporary directory is: /tmp/tmp.K1DyIAaV1g ### Start sga preprocessing ### preprocess: WARNING - it is suggested that the min read length is 40 preprocess: Using very short reads may considerably impact the performance Parameters: QualTrim: 0 QualFilter: no filtering HardClip: 0 Min length: 0 Sample freq: 1 PE Mode: 0 Quality scaling: 2 MinGC: 0 MaxGC: 1 Outfile: SRR6219624_1.SGAICEpreproc.fastq Orphan file: none Seed: 1527619454 Processing SRR6219624_1.fastq [...] [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. SGA-ICE.sh). For example:
#!/bin/bash module load SGA-ICE cd /data/teacher cp -r /usr/local/apps/SGA-ICE/TEST_DATA . SGA-ICE.py TEST_DATA -k 40,60,100,125,150,200 -t $SLURM_CPUS_PER_TASK --noCleanup --noOvlCorr --scriptName test_sga.sh cd TEST_DATA ./test_sga.sh
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=8 --mem=20g SGA-ICE.sh
Create a swarmfile (e.g. SGA-ICE.swarm). For example:
cd /dir/fastq/1; SGA-ICE.py /dir/fastq/1 -t $SLURM_CPUS_PER_TASK [...] run1.sh; ./run1.sh cd /dir/fastq/2; SGA-ICE.py /dir/fastq/2 -t $SLURM_CPUS_PER_TASK [...] run2.sh; ./run2.sh cd /dir/fastq/3; SGA-ICE.py /dir/fastq/3 -t $SLURM_CPUS_PER_TASK [...] run3.sh; ./run3.sh
Submit this job using the swarm command.
swarm -f SGA-ICE.swarm -g 10 -t 8 --module SGA-ICEwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module SGA-ICE | Loads the SGA-ICE module for each subjob in the swarm |