Biowulf High Performance Computing at the NIH
READemption on Biowulf

READemption is a RNA-Seq pipeline from alignment with segemehl to differential gene expression analysis with DESeq. READemption projects follow a specific layout by convention.

READemption is structured as a single executable with multiple subcommands:

usage: reademption [-h] [--version]

positional arguments:
    create              Create a project
    align               Run read alignings
    coverage            Create coverage (wiggle) files
    gene_quanti         Quantify the expression gene wise
    deseq               Compare expression pairwise using DESeq
    viz_align           Generate read processing and alignment visualisations.
    viz_gene_quanti     Generate gene wise quantification visualisations.
    viz_deseq           Generate DESeq visualisations.

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show version

Subcommands will take the analysis folder as an argument. If none is given, it is assumed that the current folder is the analysis folder.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=6g --cpus-per-task=10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load reademption

Create an analysis folder

[user@cn3144]$ cd /data/$USER/test_data
[user@cn3144]$ reademption create ecoli
[user@cn3144]$ tree ecoli

|-- [user   4.0K]  input
|   |-- [user   4.0K]  annotations
|   |-- [user   4.0K]  reads
|   `-- [user   4.0K]  reference_sequences
`-- [user   4.0K]  output
    |-- [user   4.0K]  align
    |   |-- [user   4.0K]  alignments
    |   |-- [user   4.0K]  index
    |   |-- [user   4.0K]  processed_reads
    |   |-- [user   4.0K]  reports_and_stats
    |   |   |-- [user   4.0K]  stats_data_json
    |   |   `-- [user    244]  versions_of_used_libraries.txt
    |   `-- [user   4.0K]  unaligned_reads
    |-- [user   4.0K]  coverage
    |   |-- [user   4.0K]  coverage-raw
    |   |-- [user   4.0K]  coverage-tnoar_mil_normalized
    |   `-- [user   4.0K]  coverage-tnoar_min_normalized
    |-- [user   4.0K]  deseq
    |   |-- [user   4.0K]  deseq_raw
    |   `-- [user   4.0K]  deseq_with_annotations
    |-- [user   4.0K]  gene_quanti
    |   |-- [user   4.0K]  gene_quanti_combined
    |   `-- [user   4.0K]  gene_quanti_per_lib
    |-- [user   4.0K]  viz_align
    |-- [user   4.0K]  viz_deseq
    `-- [user   4.0K]  viz_gene_quanti

Add some example data to the project directory (a single sample for simplicity)

[user@cn3144]$ cd ecoli
[user@cn3144]$ cp $READEMPTION_TEST_DATA/SRR1536586.fastq.gz input/reads
[user@cn3144]$ cp $READEMPTION_TEST_DATA/ecoli_mg1655.gff3 input/annotations
[user@cn3144]$ cp $READEMPTION_TEST_DATA/ecoli_mg1655.fa.gz input/reference_sequences

Align the reads to the reference (from within the project dir)

[user@cn3144]$ reademption align -p 10 -q

This creates preprocessed reads, aligned reads in bam format, and some stats files. For example

[user@cn3144]$ cat output/align/reports_and_stats/read_alignment_stats.csv
Libraries       SRR1536586
No. of input reads      6503557
No. of reads - PolyA detected and removed       0
No. of reads - Single 3' A removed      0
No. of reads - Unmodified       6503557
No. of reads - Removed as too short     0
No. of reads - Long enough and used for alignment       6503557
Total no. of aligned reads      5614693
Total no. of unaligned reads    888864
Total no. of uniquely aligned reads     3529743
Total no. of alignments 16776668
Total no. of split alignments   0
Percentage of aligned reads (compared to no. of input reads)    86.33
Percentage of aligned reads (compared to no. of long enough reads)      86.33
Percentage of uniquely aligned reads (in relation to all aligned reads) 62.87
Chromosome - No. of aligned reads       5614693
Chromosome - No. of uniquely aligned reads      3529743
Chromosome - No. of alignments  16776668
Chromosome - No. of split alignments    0

Next, create coverage tracks in wiggle format

[user@cn3144]$ reademption coverage -p 10

and quantitate gene expression

[user@cn3144]$ reademption gene_quanti -p 10 --pseudocounts

Skipping differential gene expression analysis with reademption deseq since this example just had one replicate of one condition.

Graphs of read lenghts before/after read clipping

[user@cn3144]$ reademption viz_align

After these processes complete, the output directory in the project will contain all the generated files. Don't forget to exit the interactive session when you are done:

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226

Batch job
Most jobs should be run as batch jobs.

Combine the first few steps of the READemption pipeline into a single batch script similar to the following:

#! /bin/bash
set -e

module load reademption/0.4.5 || exit 1
cd /data/$USER/reademption_test
reademption align -p $SLURM_CPUS_PER_TASK -q
reademption coverage -p $SLURM_CPUS_PER_TASK
reademption gene_quanti -p $SLURM_CPUS_PER_TASK --pseudocounts

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=10 --mem=10g
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. reademption.swarm). For example:

cd /data/user/reademption/experiment1 && reademption align -p $SLURM_CPUS_PER_TASK -q
cd /data/user/reademption/experiment2 && reademption align -p $SLURM_CPUS_PER_TASK -q
cd /data/user/reademption/experiment3 && reademption align -p $SLURM_CPUS_PER_TASK -q

Submit this job using the swarm command.

swarm -f reademption.swarm -g 10 -t 10 --module reademption
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module reademption Loads the reademption module for each subjob in the swarm