Fastq-pair rewrites paired end fastq files to make sure that all reads have a mate and to separate out singletons. This code does one thing: it takes two fastq files, and generates four fastq files. That's right, for free it doubles the number of fastq files that you have!! Usually when you get paired end read files you have two files with a /1 sequence in one and a /2 sequence in the other (or a /f and /r or just two reads with the same ID). However, often when working with files from a third party source (e.g. the SRA) there are different numbers of reads in each file (because some reads fail QC). Spades, bowtie2 and other tools break because they demand paired end files have the same number of reads. This program solves that problem.It rewrites the files with the sequences in order, with matching files for the two files provided on the command line, and then any single reads that are not matched are place in two separate files, one for each original file. This code is designed to be fast and memory efficient, and works with large fastq files. It does not store the whole file in memory, but rather just stores the locations of each of the indices in the first file provided in memory.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load fastq-pair [+] Loading fastq-pair 1.0 on cn3144 [user@cn3144 ~]$ cp -a $EXAMPLES . [user@cn3144 ~]$ fastq_pair -t 1000 test/left.fastq test/right.fastq Writing the paired reads to test/left.fastq.paired.fq and test/right.fastq.paired.fq. Writing the single reads to test/left.fastq.single.fq and test/right.fastq.single.fq Left paired: 50 Right paired: 50 [user@cn3144 ~]$ exit
Create a batch input file (e.g. fastq-pair.sh). For example:
#!/bin/bash set -e module load module load fastq-pair fastq_pair -t 1000 test/left.fastq test/right.fastq
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] fastq-pair.sh
Create a swarmfile (e.g. fastq-pair.swarm). For example:
fastq_pair [...] file1.fastq file2.fastq fastq_pair [...] file3.fastq file4.fastq fastq_pair [...] file5.fastq file6.fastq fastq_pair [...] file7.fastq file8.fastq
Submit this job using the swarm command.
swarm -f fastq-pair.swarm [-g #] [-t #] --module fastq-pairwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module fastq-pair | Loads the fastq-pair module for each subjob in the swarm |