Biowulf High Performance Computing at the NIH
Taiji on Biowulf

The Taiji software is a versatile genomics data analysis pipeline. It can be used to analyze ATAC-seq, RNA-seq, single cell ATAC-seq and Drop-seq data. Taiji accepts many data formats. It can start with raw data like fastq or post-processed files like bam or bed files.

Subcommands

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --mem=16g --cpus-per-task=16 --time=30:00:00 
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load taiji
[user@cn3144 ~]$ mkdir -p /data/$USER/taiji; cd /data/$USER/taiji
[user@cn3144 ~]$ cp $TAIJI_TEST_DATA/* . 
# Editing input.tsv and configuration file.
# input.tsv and config.yml is based on GRCh38 
# Run taiji in single thread
[user@cn3144 ~]$ taiji run --config config.yml
WARNING: Bind mount '/data => /data' overlaps container CWD /data/apptest3/taiji, may not be available
[INFO][05-01 16:16] ATAC_Download_Data(4582..): Running ...
[INFO][05-01 16:18] ATAC_Download_Data(4582..): Complete!
[INFO][05-01 16:18] RNA_Read_Input(7785..): Running ...
[INFO][05-01 16:18] RNA_Read_Input(7785..): Complete!
[INFO][05-01 16:18] ATAC_Download_Data(7958..): Running ...
[INFO][05-01 16:21] ATAC_Download_Data(7958..): Complete!
[INFO][05-01 16:21] SCATAC_Read_Input(fa14..): Running ...
[...]

[user@cn3144 ~]$ tree output
output
├── ATACSeq
│   ├── Bam
│   │   ├── 2h_rep1.bam
│   │   ├── 2h_rep1_filt.bam
│   │   ├── 2h_rep1_filt_dedup.bam
│   │   ├── 4h_rep1.bam
│   │   ├── 4h_rep1_filt.bam
│   │   ├── 4h_rep1_filt_dedup.bam
│   │   ├── control_rep1.bam
│   │   ├── control_rep1_filt.bam
│   │   └── control_rep1_filt_dedup.bam
│   ├── Bed
│   │   ├── 2h_rep1.bed.gz
│   │   ├── 4h_rep1.bed.gz
│   │   └── control_rep1.bed.gz
│   ├── BigWig
│   │   ├── 2h_rep0.bw
│   │   ├── 4h_rep0.bw
│   │   └── control_rep0.bw
│   ├── Download
│   │   ├── ENCFF173INV.fastq.gz
│   │   ├── ENCFF322IZC.fastq.gz
│   │   ├── ENCFF443HVZ.fastq.gz
│   │   ├── ENCFF562JHD.fastq.gz
│   │   ├── ENCFF893KQZ.fastq.gz
│   │   └── ENCFF943SYH.fastq.gz
│   ├── GeneQuant
│   │   ├── 2h_rep0_gene_quant.tsv
│   │   ├── 4h_rep0_gene_quant.tsv
│   │   ├── control_rep0_gene_quant.tsv
│   │   └── expression_profile.tsv
│   ├── openChromatin.bed
│   ├── Peaks
│   │   ├── 2h_rep0.narrowPeak
│   │   ├── 2h_rep1.signal.bin
│   │   ├── 4h_rep0.narrowPeak
│   │   ├── 4h_rep1.signal.bin
│   │   ├── control_rep0.narrowPeak
│   │   └── control_rep1.signal.bin
│   ├── QC
│   │   └── qc.html
│   └── TFBS
│       ├── 2h_rep0.bed
│       ├── 4h_rep0.bed
│       ├── control_rep0.bed
│       ├── motif_sites_part17102-0.bed
│       ├── motif_sites_part17102-1.bed
│       ├── motif_sites_part17102-2.bed
│       ├── motif_sites_part17102-3.bed
│       ├── motif_sites_part17102-4.bed
│       ├── motif_sites_part17102-4.bed
│       ├── motif_sites_part17102-5.bed
│       ├── motif_sites_part17102-6.bed
│       ├── motif_sites_part44433-0.bed
│       └── motif_sites_part44433-1.bed
├── GeneRanks.html
├── GeneRanks_PValues.tsv
├── GeneRanks.tsv
├── GENOME
│   └── genome.index
└── Network
    ├── 2h
    │   ├── edges_binding.csv
    │   ├── edges_combined.csv
    │   └── nodes.csv
    ├── 4h
    │   ├── edges_binding.csv
    │   ├── edges_combined.csv
    │   └── nodes.csv
    └── control
        ├── edges_binding.csv
        ├── edges_combined.csv
        └── nodes.csv

14 directories, 58 files

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$sinteractive 
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load taiji
[user@cn3144 ~]$ mkdir -p /data/$USER/taiji2; cd /data/$USER/taiji2
[user@cn3144 ~]$ cp $TAIJI_TEST_DATA/* . 
i# Run taiji in multiple threads.
# input2.tsv and config2.yml is based on mm10
[user@cn3144 ~]$ taiji run --config config2.yml -n 20 --cloud

[user@cn3144 ~]$ exit

Batch job
Most jobs should be run as batch jobs.

Taiji can create and submit the batch job for you.

Create a batch input file (e.g. taiji.sh, which uses the input file 'input2.tsv' and configuration file 'config2.yml'. For example:

#! /bin/bash
set -e
module load taiji
taiji run --config config2.yml -n 20 --cloud

Submit this job using the Slurm sbatch command.

sbatch taiji.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. taiji.swarm). For example:

cd fd1; taiji run --config config_1.yml -n 20 --cloud
cd fd2; taiji run --config config_2.yml -n 20 --cloud
cd fd3; taiji run --config config_3.yml -n 20 --cloud

Submit this job using the swarm command.

swarm -f taiji.swarm --module taiji