Biowulf High Performance Computing at the NIH
Dropseq on Biowulf

Drop-seq is a technology that allows biologists to analyze genome-wide gene expression in thousands of individual cells in a single experiment.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load dropseq

[user@cn3144 ~]$ cd /data/$USER/dir

[user@cn3144 ~]$ BamTagHistogram -- -h
USAGE: BAMTagHistogram [options]

Create a histogram of values for the given tag
Version: 1.0(a568873_1439010606)


Options:

--help
-h                            Displays options specific to this tool.

--stdhelp
-H                            Displays options specific to this tool AND options common to all Picard command line 
                              tools.

--version                     Displays program version.

INPUT=File
I=File                        The input SAM or BAM file to analyze.  Must be coordinate sorted. (???)  Required. 

OUTPUT=File
O=File                        Output file of histogram of tag value frequencies. This supports zipped formats like gz 
                              and bz2.  Required. 

TAG=String                    Tag to extract  Required. 

FILTER_PCR_DUPLICATES=Boolean Filter PCR Duplicates.  Default value: false. This option can be set to 'null' to clear 
                              the default value. Possible values: {true, false} 

READ_QUALITY=Integer          Read quality filter.  Filters all reads lower than this mapping quality.  Defaults to 10.  
                              Set to 0 to not filter reads by map quality.  Default value: 10. This option can be set 
                              to 'null' to clear the default value. 


[user@cn3144 ~]$ java -jar $DROPSEQ_JAR
USAGE: DropSeqMain  [-h]

Available Programs:
--------------------------------------------------------------------------------------
DropSeq Tools:                                   Tools for aligning or analyzing DropSeq experiments.
    BamTagHistogram                              Create a histogram of values for the given tag
    BamTagOfTagCounts                            For a given BAM tag, how many unique values of a second BAM tag are present?
    BaseDistributionAtReadPosition               Reads each base and generates a composition per-position matrix
    CollapseBarcodesInPlace                      Fold down barcodes, possibly in the context of another barcode (that has been folded down already.)
    CollapseTagWithContext                       Collapse barcodes in the context of one or more tags.)
    CompareAnnotationFlags                       Test program don't use.
    CompareBAMTagValues                          Tests that two BAMs have the same TAG values per read.
    CompareDropSeqAlignments                     Compare two alignments
    ComputeUMISharing                            Computes UMI sharing between uncollapsed and collapsed sets of reads.
    ConvertTagToReadGroup                        Convert from a cell barcode tag to a sample group
    CreateSnpIntervalFromVcf                     Creates an interval file of variants from a VCF
    DetectBeadSubstitutionErrors                 Collaps umambiguously related small barcodes into larger neighbors.)
    DetectBeadSynthesisErrors                    Detect barcode synthesis errors where the final base of a UMI is fixed across all UMIs of a cell.
    DigitalExpression                            Calculate Digital Expression
    FilterBam                                    Filters a BAM file by various qualities to produce a new subset of the BAM containing the reads of interest.
[...]
--------------------------------------------------------------------------------------
Reference:                                       Tools that analyze and manipulate FASTA format references
    MaskReferenceSequence                        Modify reference sequence fasta contig sequence.

--------------------------------------------------------------------------------------
SpermSeq Tools:                                  Tools for aligning or analyzing SpermSeq experiments.
    GenotypeSperm                                Detect which alleles of which SNPs are present in each sperm cell
    SpermSeqMarkDuplicates                       Mark SpermSeq PCR Duplicates 

--------------------------------------------------------------------------------------



[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. dropseq.sh). For example:

#!/bin/bash
module load dropseq 
cd /data/$USER/dir 
BAMTagHistogram -I=file1 -O=file2 ....
....
....

Submit this job using the Slurm sbatch command.

sbatch dropseq.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. dropseq.swarm). For example:

cd /data/user/run1/; BAMTagHistogram -I=file1 -O=file2 
cd /data/user/run2/; BAMTagHistogram -I=file1 -O=file2 
cd /data/user/run3/; BAMTagHistogram -I=file1 -O=file2 
........

Submit this job using the swarm command.

swarm -f dropseq.swarm --module dropseq
where
--module dropseq Loads the dropseq module for each subjob in the swarm