High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
cutadapt on Biowulf & Helix

Description

Cutadapt removes adapter sequences, primers, poly-A tails, low quality segments, and other unwanted sequence from your high-throughput sequencing reads.

There are multiple versions of cutadapt available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail cutadapt 

To select a module use

module load cutadapt/[version]

where [version] is the version of choice.

Note that a current version of cutadapt is also installed in all python environments:

module load python
cutadapt --version
# 1.9.1

Environment variables set

References

Documentation

Running cutadapt on Helix

Cutadapt is part of all python2 and python3 environments. A sample session that shows how to remove the universal TrueSeq adapter sequences, low quality 3' segments, as well as terminal N stretches from paired compressed fastq files:


helix$ module load python
helix$ module list
Currently Loaded Modules:
  1) python/2.7.9
helix$ cd /data/$USER/test_data
helix$ cutadapt -q 10 --minimum-length 25 --trim-n \
  -a AGATCGGAAGAGC -A AGATCGGAAGAGC \
  -o out/read1.fastq.gz -p out/read2.fastq.gz \
  read1.fastq.gz read2.fastq.gz
=== Summary ===

Total read pairs processed:            250,000
  Read 1 with adapter:                   6,123 (2.4%)
  Read 2 with adapter:                   6,348 (2.5%)
Pairs that were too long:               38,014 (15.2%)
Pairs written (passing filters):       211,986 (84.8%)

Total basepairs processed:    25,000,000 bp
  Read 1:    12,500,000 bp
  Read 2:    12,500,000 bp
Quality-trimmed:               2,670,047 bp (10.7%)
  Read 1:       838,585 bp
  Read 2:     1,831,462 bp
Total written (filtered):     21,070,344 bp (84.3%)
  Read 1:    10,514,326 bp
  Read 2:    10,556,018 bp
........
Running a single cutadapt batch job on Biowulf

Set up a batch script for the adapter trimming job:


#! /bin/bash
#SBATCH --job-name=cutadapt

set -e

r1=fastq/read1.fastq.gz
r2=fastq/read2.fastq.gz

module load cutadapt/1.9.1 || exit 1
cd $SLURM_SUBMIT_DIR 
cutadapt -q 10 --trim-n --minimum-length 25 \
  -a AGATCGGAAGAGC -A AGATCGGAAGAGC \
  -o fastq_clean/${r1#fastq} -p fastq_clean/${r2#fastq} \
  $r1 $r2
The batch script is submitted for processing with

sbatch cutadapt_batch_script.sh
Running a swarm of cutadapt batch jobs on Biowulf

The following swarm file would remove TrueSeq adapters from the 3' end and trim low quality and N stretches from single ended fastq files


cutadapt -q10, --trim-n --minimum-length 25 -a AGATCGGAAGAGC -o clean1.fq.gz dirty1.fq.gz
cutadapt -q10, --trim-n --minimum-length 25 -a AGATCGGAAGAGC -o clean2.fq.gz dirty2.fq.gz
cutadapt -q10, --trim-n --minimum-length 25 -a AGATCGGAAGAGC -o clean2.fq.gz dirty2.fq.gz

The swarm file is then executed with default settings


biowulf$ swarm -f swarmfile --module cutadapt/1.9.1
Running an interactive job on Biowulf

It may be useful to run cutadapt jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.


biowulf$ sinteractive
salloc.exe: Granted job allocation 133383
srun: error: x11: no local DISPLAY defined, skipping
cn0147$ module load python
cn0147$ cutadapt -q10, --trim-n --minimum-length 25 \
    -a AGATCGGAAGAGC -o clean1.fq.gz dirty1.fq.gz
.....
cn0147$ exit
biowulf$