High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
poretools on Biowulf & Helix

Description

Poretools is a toolkit for manipulating and exploring nanopore sequencing data sets. Poretools operates on individual FAST5 files, directory of FAST5 files, and tar archives of FAST5 files.

There may be multiple versions of poretools available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail poretools 

To select a module use

module load poretools/[version]

where [version] is the version of choice.

Environment variables set

References

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive and set up the environment. The example data set is a subsample of a run from the ebola surveillance project.

biowulf$ sinteractive --mem=6g
node$ module load poretools
[+] Loading poretools 0.6.1a1
node$ poretools --help
usage: poretools [-h] [-v]
                 {combine,fastq,fasta,...,yield_plot,occupancy,organise}
                 ...

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Installed poretools version

[sub-commands]:
  {combine,fastq,fasta,...,yield_plot,occupancy,organise}
    combine             Combine a set of FAST5 files in a TAR achive
    fastq               Extract FASTQ sequences from a set of FAST5 files
    fasta               Extract FASTA sequences from a set of FAST5 files
    stats               Get read size stats for a set of FAST5 files
    hist                Plot read size histogram for a set of FAST5 files
    events              Extract each nanopore event for each read.
    readstats           Extract signal information for each read over time.
    tabular             Extract the lengths and name/seq/quals from a set of
                        FAST5 files in TAB delimited format
    nucdist             Get the nucl. composition of a set of FAST5 files
    metadata            Return run metadata such as ASIC ID and temperature
                        from a set of FAST5 files
    index               Tabulate all file location info and metadata such as
                        ASIC ID and temperature from a set of FAST5 files
    qualdist            Get the qual score composition of a set of FAST5 files
    qualpos             Get the qual score distribution over positions in
                        reads
    winner              Get the longest read from a set of FAST5 files
    squiggle            Plot the observed signals for FAST5 reads.
    times               Return the start times from a set of FAST5 files in
                        tabular format
    yield_plot          Plot the yield over time for a set of FAST5 files
    occupancy           Inspect pore activity over time for a set of FAST5
                        files
    organise            Move FAST5 files into a useful folder hierarchy


node$ cp ${PORETOOLS_TEST_DATA}/ERA484348_014370_subset.tar .
node$ ls -lh ERA484348_014370_subset.tar
-rw-r--r-- 1 user group 2.1G Jun 30 08:42 ERA484348_014370_subset.tar

Extract fastq format sequences from all the pass reads in the example data set. Note that poretools can work on a tar archive directly - no need to extract the archive and create large numbers of small files that can degrade file system performance.

node$ poretools fastq ERA484348_014370_subset.tar | gzip -c - > 014370.fastq.gz
node$ ls -lh 014370.fastq.gz
-rw-r--r-- 1 user group 13M Jun 30 08:51 014370.fastq.gz

Create a collector's curve of yield

node$ poretools yield_plot --plot-type reads --saveas yield.png \
            ERA484348_014370_subset.tar
poretools yield curve

Read size information

node$ poretools stats ERA484348_014370_subset.tar
total reads     11787
total base pairs        12470421
mean    1057.98
median  1013
min     320
max     2877
N25     1060
N50     1019
N75     981

node$ poretools stats --type fwd ERA484348_014370_subset.tar
total reads     3929
total base pairs        4112608
mean    1046.73
median  1001
min     320
max     2833
N25     1044
N50     1007
N75     973

node$ poretools hist --saveas size.png --theme-bw ERA484348_014370_subset.tar
poretools size histogram

Quality score distribution by position

node$ poretools qualpos --saveas qual.pdf \
  --bin-width 100 ERA484348_014370_subset.tar
poretools quality vs pos
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is poretools.batch

module load poretools || exit 1
poretools stats ERA484348_014370_subset.tar > stats
poretools qualpos --saveas qual.png --bin-width 100 ERA484348_014370_subset.tar

Submit to the queue with sbatch:

biowulf$ sbatch --time=20 poretools.batch