High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
READemption on Biowulf & Helix

Description

READemption is a RNA-Seq pipeline from alignment with segemehl to differential gene expression analysis with DESeq. READemption projects follow a specific layout by convention.

READemption is structured as a single executable with multiple subcommands:

usage: reademption [-h] [--version]
                   {create,align,coverage,gene_quanti,deseq,viz_align,viz_gene_quanti,viz_deseq}
                   ...

positional arguments:
  {create,align,coverage,gene_quanti,deseq,viz_align,viz_gene_quanti,viz_deseq}
                        commands
    create              Create a project
    align               Run read alignings
    coverage            Create coverage (wiggle) files
    gene_quanti         Quantify the expression gene wise
    deseq               Compare expression pairwise using DESeq
    viz_align           Generate read processing and alignment visualisations.
    viz_gene_quanti     Generate gene wise quantification visualisations.
    viz_deseq           Generate DESeq visualisations.

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show version

Subcommands will take the analysis folder as an argument. If none is given, it is assumed that the current folder is the analysis folder.

References

Web sites

On Helix

Some stages in the READemption pipeline are multithreaded. On helix, please limit any such stages to two threads. For more threads run stages as batch jobs or in interactive sessions.

Set up the environment

helix$ module load segemehl reademption

and create an analysis folder

helix$ cd /data/$USER/test_data
helix$ reademption create ecoli
helix$ tree ecoli

ecoli
|-- [user   4.0K]  input
|   |-- [user   4.0K]  annotations
|   |-- [user   4.0K]  reads
|   `-- [user   4.0K]  reference_sequences
`-- [user   4.0K]  output
    |-- [user   4.0K]  align
    |   |-- [user   4.0K]  alignments
    |   |-- [user   4.0K]  index
    |   |-- [user   4.0K]  processed_reads
    |   |-- [user   4.0K]  reports_and_stats
    |   |   |-- [user   4.0K]  stats_data_json
    |   |   `-- [user    244]  versions_of_used_libraries.txt
    |   `-- [user   4.0K]  unaligned_reads
    |-- [user   4.0K]  coverage
    |   |-- [user   4.0K]  coverage-raw
    |   |-- [user   4.0K]  coverage-tnoar_mil_normalized
    |   `-- [user   4.0K]  coverage-tnoar_min_normalized
    |-- [user   4.0K]  deseq
    |   |-- [user   4.0K]  deseq_raw
    |   `-- [user   4.0K]  deseq_with_annotations
    |-- [user   4.0K]  gene_quanti
    |   |-- [user   4.0K]  gene_quanti_combined
    |   `-- [user   4.0K]  gene_quanti_per_lib
    |-- [user   4.0K]  viz_align
    |-- [user   4.0K]  viz_deseq
    `-- [user   4.0K]  viz_gene_quanti

Let's add some example data to the project directory (a single sample for simplicity)

helix$ cd ecoli
helix$ ln -s /usr/local/apps/reademption/TEST_DATA/SRR1536586.fastq.gz \
  input/reads
helix$ ln -s /usr/local/apps/reademption/TEST_DATA/ecoli_mg1655.gff3 \
  input/annotations
helix$ ln -s /usr/local/apps/reademption/TEST_DATA/ecoli_mg1655.fa.gz \
  input/reference_sequences

Align the reads to the reference (from within the project dir)

helix$ reademption align -p 2 -q

This creates preprocessed reads, aligned reads in bam format, and some stats files. For example

helix$ cat output/align/reports_and_stats/read_alignment_stats.csv
Libraries       SRR1536586
No. of input reads      6503557
No. of reads - PolyA detected and removed       0
No. of reads - Single 3' A removed      0
No. of reads - Unmodified       6503557
No. of reads - Removed as too short     0
No. of reads - Long enough and used for alignment       6503557
Total no. of aligned reads      5614693
Total no. of unaligned reads    888864
Total no. of uniquely aligned reads     3529743
Total no. of alignments 16776668
Total no. of split alignments   0
Percentage of aligned reads (compared to no. of input reads)    86.33
Percentage of aligned reads (compared to no. of long enough reads)      86.33
Percentage of uniquely aligned reads (in relation to all aligned reads) 62.87
Chromosome - No. of aligned reads       5614693
Chromosome - No. of uniquely aligned reads      3529743
Chromosome - No. of alignments  16776668
Chromosome - No. of split alignments    0
helix$ cat output/align/reports_and_stats/stats_data_json/read_processing.json
{
    "SRR1536586": {
        "unmodified": 6503557,
        "read_length_before_processing_and_freq": {
            "50": 6503557
        },
        "single_a_removed": 0,
        "total_no_of_reads": 6503557,
        "too_short": 0,
        "read_length_after_processing_and_freq": {
            "50": 6503557
        },
        "polya_removed": 0,
        "long_enough": 6503557
    }
}

Next, create coverage tracks in wiggle format

helix$ reademption coverage -p 2

and quantitate gene expression

helix$ reademption gene_quanti -p 2 --pseudocounts

Skipping differential gene expression analysis with reademption deseq since this example just had one replicate of one condition.

Graphs of read lenghts before/after read clipping

helix$ reademption viz_align

After these processes complete, the output directory in the project will contain all the generated files

output/
|-- [user   4.0K]  align
|   |-- [user   4.0K]  alignments
|   |   |-- [user   273M]  SRR1536586_alignments_final.bam
|   |   `-- [user    10K]  SRR1536586_alignments_final.bam.bai
|   |-- [user   4.0K]  index
|   |   `-- [user    63M]  index.idx
|   |-- [user   4.0K]  processed_reads
|   |   |-- [user      0]  SRR1536586_processed.fa
|   |   `-- [user   140M]  SRR1536586_processed.fa.gz
|   |-- [user   4.0K]  reports_and_stats
|   |   |-- [user    829]  read_alignment_stats.csv
|   |   |-- [user   4.0K]  stats_data_json
|   |   |   |-- [user   1.7K]  read_alignments_final.json
|   |   |   `-- [user    389]  read_processing.json
|   |   `-- [user    244]  versions_of_used_libraries.txt
|   `-- [user   4.0K]  unaligned_reads
|       `-- [user   127M]  SRR1536586_unaligned.fa
|-- [user   4.0K]  coverage
|   |-- [user   4.0K]  coverage-raw
|   |   |-- [user    26M]  SRR1536586_forward.wig
|   |   `-- [user    28M]  SRR1536586_reverse.wig
|   |-- [user   4.0K]  coverage-tnoar_mil_normalized
|   |   |-- [user    54M]  SRR1536586_div_by_5614693.0_multi_by_1000000.0_forward.wig
|   |   `-- [user    58M]  SRR1536586_div_by_5614693.0_multi_by_1000000.0_reverse.wig
|   `-- [user   4.0K]  coverage-tnoar_min_normalized
|       |-- [user    26M]  SRR1536586_div_by_5614693.0_multi_by_5614693.0_forward.wig
|       `-- [user    28M]  SRR1536586_div_by_5614693.0_multi_by_5614693.0_reverse.wig
|-- [user   4.0K]  deseq
|   |-- [user   4.0K]  deseq_raw
|   `-- [user   4.0K]  deseq_with_annotations
|-- [user   4.0K]  gene_quanti
|   |-- [user   4.0K]  gene_quanti_combined
|   |   |-- [user   6.5M]  gene_wise_quantifications_combined.csv
|   |   |-- [user   6.6M]  gene_wise_quantifications_combined_rpkm.csv
|   |   `-- [user   6.7M]  gene_wise_quantifications_combined_tnoar.csv
|   `-- [user   4.0K]  gene_quanti_per_lib
|       `-- [user   3.3M]  SRR1536586_to_ecoli_mg1655.gff3.csv
|-- [user   4.0K]  viz_align
|   |-- [user    14K]  input_reads_length_distributions.pdf
|   `-- [user    14K]  processed_reads_length_distributions.pdf
|-- [user   4.0K]  viz_deseq
`-- [user   4.0K]  viz_gene_quanti
Batch job on Biowulf

Combine the first few steps of the READemption pipeline into a single batch script similar to the following:

#! /bin/bash
set -e

module load segemehl/0.2.0 reademption/0.3.7 || exit 80
cd /data/$USER/reademption_test
reademption align -p 10 -q
reademption coverage -p 10
reademption gene_quanti -p 10 --pseudocounts

And submit the batch script to the slurm queue

b2$ sbatch --mem=4G --cpus-per-task=10 reademption_batch.sh

Note that in this example a small genome is used. For larger genomes, the segemehl aligner consumes more memory.

Interactive job on Biowulf

Allocate an interactive session and set up the environment

b2$ sinteractive --mem=4G --cpus-per-task=10
[...snip...]
cn0034$ module load segemehl reademption
[+] Loading segemehl 0.2.0
[+] Loading reademption 0.3.7
cn0034$ cd /path/to/project
cn0034$ reademption align -p10
[...snip...]

After allocating an interactive session, use reademption as described for helix above.

Documentation

Web sites