Cutadapt removes adapter sequences, primers, poly-A tails, low quality segments, and other unwanted sequence from your high-throughput sequencing reads.
$CUTADAPT_TEST_DATA
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load cutadapt [user@cn3144 ~]$ # copy some paired end RNASeq data (Illumina) [user@cn3144 ~]$ cp $CUTADAPT_TEST_DATA/* . [user@cn3144 ~]$ ls -lh -rw-r--r-- 1 user group 45M Feb 2 07:22 read1_1000k.fastq.gz -rw-r--r-- 1 user group 35M Feb 2 07:22 read2_1000k.fastq.gz [user@cn3144 ~]$ cutadapt -q 10 --minimum-length 25 \ -a AGATCGGAAGAGC -A AGATCGGAAGAGC \ -o read1_trimmed.fastq.gz -p read2_trimmed.fastq.gz \ read1_1000k.fastq.gz read2_1000k.fastq.gz This is cutadapt 1.15 with Python 3.6.4 Command line parameters: -q 10 --minimum-length 25 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o read1.fastq.gz -p read2.fastq gz read1_1000k.fastq.gz read2_1000k.fastq.gz Running on 1 core Trimming 2 adapters with at most 10.0% errors in paired-end mode ... Finished in 53.95 s (54 us/read; 1.11 M reads/minute). === Summary === Total read pairs processed: 1,000,000 Read 1 with adapter: 22,329 (2.2%) Read 2 with adapter: 23,325 (2.3%) Pairs that were too short: 221,834 (22.2%) Pairs written (passing filters): 778,166 (77.8%) Total basepairs processed: 100,000,000 bp Read 1: 50,000,000 bp Read 2: 50,000,000 bp Quality-trimmed: 29,853,878 bp (29.9%) Read 1: 14,926,939 bp Read 2: 14,926,939 bp Total written (filtered): 77,202,392 bp (77.2%) Read 1: 38,478,257 bp Read 2: 38,724,135 bp === First read: Adapter 1 === Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 22329 times. No. of allowed errors: 0-9 bp: 0; 10-13 bp: 1 Bases preceding removed adapters: A: 26.6% C: 28.3% G: 28.1% T: 16.9% none/other: 0.1% Overview of removed sequences length count expect max.err error counts 3 17956 15625.0 0 17956 4 3301 3906.2 0 3301 ... [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. cutadapt.sh), which uses the input file 'cutadapt.in'. For example:
#! /bin/bash set -e r1=fastq/read1.fastq.gz r2=fastq/read2.fastq.gz module load cutadapt/1.15 || exit 1 cutadapt -q 10 --trim-n --minimum-length 25 \ --cores=$SLURM_CPUS_PER_TASK \ -a AGATCGGAAGAGC -A AGATCGGAAGAGC \ -o fastq_clean/${r1#fastq} -p fastq_clean/${r2#fastq} \ $r1 $r2
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=4 [--mem=#] cutadapt.sh
Create a swarmfile (e.g. cutadapt.swarm). For example:
cutadapt -q10, --trim-n --minimum-length 25 -a AGATCGGAAGAGC -o clean1.fq.gz dirty1.fq.gz cutadapt -q10, --trim-n --minimum-length 25 -a AGATCGGAAGAGC -o clean2.fq.gz dirty2.fq.gz cutadapt -q10, --trim-n --minimum-length 25 -a AGATCGGAAGAGC -o clean2.fq.gz dirty2.fq.gz
Submit this job using the swarm command.
swarm -f cutadapt.swarm [-g #] [-t #] --module cutadaptwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module cutadapt | Loads the cutadapt module for each subjob in the swarm |