fastq-pair on Biowulf

Fastq-pair on Biowulf

Quick Links

Fastq-pair rewrites paired end fastq files to make sure that all reads have a mate and to separate out singletons. This code does one thing: it takes two fastq files, and generates four fastq files. That's right, for free it doubles the number of fastq files that you have!! Usually when you get paired end read files you have two files with a /1 sequence in one and a /2 sequence in the other (or a /f and /r or just two reads with the same ID). However, often when working with files from a third party source (e.g. the SRA) there are different numbers of reads in each file (because some reads fail QC). Spades, bowtie2 and other tools break because they demand paired end files have the same number of reads. This program solves that problem.It rewrites the files with the sequences in order, with matching files for the two files provided on the command line, and then any single reads that are not matched are place in two separate files, one for each original file. This code is designed to be fast and memory efficient, and works with large fastq files. It does not store the whole file in memory, but rather just stores the locations of each of the indices in the first file provided in memory.

References:

John A. Edwards, Robert A. Edwards Fastq-pair: efficient synchronization of paired-end fastq files bioRxiv 552885(2019)

Documentation

Fastq-pair Main Site

Important Notes

Module Name: fastq-pair (see the modules page for more information)
Environment variables set
- EXAMPLES

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load fastq-pair
[+] Loading fastq-pair  1.0  on cn3144
[user@cn3144 ~]$ cp -a $EXAMPLES .
[user@cn3144 ~]$ fastq_pair -t 1000 test/left.fastq test/right.fastq
Writing the paired reads to test/left.fastq.paired.fq and test/right.fastq.paired.fq.
Writing the single reads to test/left.fastq.single.fq and test/right.fastq.single.fq
Left paired: 50         Right paired: 50
[user@cn3144 ~]$ exit

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. fastq-pair.sh). For example:

#!/bin/bash
set -e
module load module load fastq-pair
fastq_pair -t 1000 test/left.fastq test/right.fastq

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] fastq-pair.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. fastq-pair.swarm). For example:

fastq_pair [...] file1.fastq file2.fastq
fastq_pair [...] file3.fastq file4.fastq
fastq_pair [...] file5.fastq file6.fastq
fastq_pair [...] file7.fastq file8.fastq

Submit this job using the swarm command.

swarm -f fastq-pair.swarm [-g #] [-t #] --module fastq-pair

where

`-g #`	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
`-t #`	Number of threads/CPUs required for each process (1 line in the swarm command file).
`--module fastq-pair`	Loads the fastq-pair module for each subjob in the swarm