Biowulf High Performance Computing at the NIH
Dr.seq on Biowulf

Description

Dr.seq is a quality control (QC) and analysis pipeline for Drop-seq data. It takes a fastq file with barcode data and a fastq file of reads along with supporting files (annotation and indices for alignment) to provide QC data at the level of reads, individual cells, bulk cells, and cell-clustering.

There are changes to the command line interface between versions. Documentation here should refer to the newest version.

There may be multiple versions of Dr.seq available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail drseq 

To select a module use

module load drseq/[version]

where [version] is the version of choice.

drseq is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

Dependencies

Drseq can use either bowtie2 or STAR as a short read mapper. Since the choice is up to the user, neither of these modules is loaded by default. Please load the correct module for your analysis manually.

When analyzind Drop-ChIP or ATAC-Seq data please load the mac 1.4 module.

R, samtools and betools are loaded automatically.

References

Documentation

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --cpus-per-task=6 --mem=10g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load bowtie drseq
[user@cn3144 ~]$  DrSeq simple \
  -b drseq_test_1.fastq \
  -r drseq_test_2.fastq \
  -n test -f \
  -g $PWD/mm10_refgenes.txt \
  --maptool bowtie2 \
  --mapindex /fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome \
  --cellbarcoderange 1:12 \
  --umirange 13:20 \
  --clean \
  --thread $SLURM_CPUS_PER_TASK
Start Drseq
Step0: Data integrate
Detected input file format is fastq
use bowtie2 as alignment tools
option setting:
mapping thread is 4
Step0 Data integrate DONE
Step1: alignment
[...snip...]
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
biowulf$
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is drseq_batch.sh
module load bowtie
module load drseq

Drseq.py simple \
  -b drseq_test_1.fastq \
  -r drseq_test_2.fastq \
  -n test_out -f \
  -g $PWD/mm10_refgenes.txt \
  --maptool bowtie2 \
  --mapindex /fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome \
  --cellbarcoderange 1:12 \
  --umirange 13:20 \
  --clean \
  --thread $SLURM_CPUS_PER_TASK

Note that in this example used data obtained from the Dr.seq home page as well as annotation obtained from the UCSC browser.

Submit to the queue with sbatch:

biowulf$ sbatch --cpus-per-task=6 --mem=10g drseq_batch.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following example:

# this file is drseq.swarm
DrSeq simple \
  -b sample1_1.fastq \
  -r sample1_2.fastq \
  -n sample1 -f \
  -g $PWD/mm10_refgenes.txt \
  --maptool bowtie2 \
  --mapindex /fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome \
  --cellbarcoderange 1:12 \
  --umirange 13:20 \
  --clean \
  --thread $SLURM_CPUS_PER_TASK
DrSeq simple \
  -b sample2_1.fastq \
  -r sample2_2.fastq \
  -n sample2 -f \
  -g $PWD/mm10_refgenes.txt \
  --maptool bowtie2 \
  --mapindex /fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome \
  --cellbarcoderange 1:12 \
  --umirange 13:20 \
  --clean \
  --thread $SLURM_CPUS_PER_TASK

And submit to the queue with swarm

biowulf$ swarm -f drseq.swarm -g10 -t6 --module drseq --module bowtie