High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Cell Ranger on Biowulf & Helix

Description

From the Cell Ranger manual:

Cell Ranger is a set of analysis pipelines that processes Chromium single cell 3’ RNA-seq output to align reads, generate gene-cell matrices and perform clustering and gene expression analysis. There are two pipelines:
  • cellranger mkfastq wraps Illumina's bcl2fastq to correctly demultiplex Chromium-prepared sequencing samples and to convert barcode and read data to FASTQ files.
  • cellranger count takes FASTQ files from cellranger mkfastq and performs alignment, filtering, and UMI counting. It uses the Chromium cellular barcodes to generate gene-cell matrices and perform clustering and gene expression analysis.

Note that the command line interface has changed since version 1.1.

These pipelines combine Chromium-specific algorithms with the widely used RNA-seq aligner STAR. Output is delivered in standard BAM, MEX, CSV, and HTML formats that are augmented with cellular information.

There may be multiple versions of cellranger available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail cellranger

To select a module use

module load cellranger/[version]

where [version] is the version of choice.

cellranger is a multithreaded application. Make sure to match the number of cpus requested with the number of threads. In addition, cellranger also has a cluster mode (see below).

Environment variables set

Dependencies

Dependencies are loaded automatically by the Cell Ranger module.

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive with sufficient memory to run STAR alignments and set up the environment

biowulf$ sinteractive --cpus-per-task=6 --mem=35g
node$ module load cellranger

Copy the bcl format test data and run the demux pipeline

node$ cp $CELLRANGER_TEST_DATA/cellranger-tiny-bcl-1.2.0.tar.gz .
node$ cp $CELLRANGER_TEST_DATA/cellranger-tiny-bcl-samplesheet-1.2.0.csv .
node$ tar -xzf cellranger-tiny-bcl-1.2.0.tar.gz
node$ cellranger mkfastq --run=cellranger-tiny-bcl-1.2.0 \
             --samplesheet=cellranger-tiny-bcl-samplesheet-1.2.0.csv \
             --localcores=$SLURM_CPUS_PER_TASK \
             --localmem=34
cellranger mkfastq (1.2.1)
Copyright (c) 2016 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - 1.2.1 (2.1.2)
Running preflight checks (please wait)...
Checking run folder...
Checking RunInfo.xml...
Checking system environment...
Checking barcode whitelist...
Checking read specification...
Checking samplesheet specs...
2016-12-21 12:27:44 [runtime] (ready)           ID.H35KCBCXY.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
[...snip...]
Outputs:
- Run QC metrics:        /spin1/users/user/test_data/cellranger/H35KCBCXY/outs/qc_summary.json
- FASTQ output folder:   /spin1/users/user/test_data/cellranger/H35KCBCXY/outs/fastq_path
- Interop output folder: /spin1/users/user/test_data/cellranger/H35KCBCXY/outs/interop_path
- Input samplesheet:     /spin1/users/user/test_data/cellranger/H35KCBCXY/outs/input_samplesheet.csv

Pipestance completed successfully!

Generate counts per gene per cell

node$ cellranger count --id s1 \
                --fastqs H35KCBCXY/outs/fastq_path \
                --transcriptome=$CELLRANGER_REF/GRCh38 \
                --indices=SI-P03-C9 \
                --cells=1000 \
                --localcores=$SLURM_CPUS_PER_TASK \
                --localmem=34
cellranger count (1.2.1)
Copyright (c) 2016 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - 1.2.1 (2.1.2)
Running preflight checks (please wait)...
Checking sample info...
Checking FASTQ folder...
[...snip...]
Outputs:
- Run summary HTML:                      /spin1/users/wresch/test_data/cellranger/s1/outs/web_summary.html
- Run summary CSV:                       /spin1/users/wresch/test_data/cellranger/s1/outs/metrics_summary.csv
- BAM:                                   /spin1/users/wresch/test_data/cellranger/s1/outs/possorted_genome_bam.bam
- BAM index:                             /spin1/users/wresch/test_data/cellranger/s1/outs/possorted_genome_bam.bam.bai
- Filtered gene-barcode matrices MEX:    /spin1/users/wresch/test_data/cellranger/s1/outs/filtered_gene_bc_matrices
- Filtered gene-barcode matrices HDF5:   /spin1/users/wresch/test_data/cellranger/s1/outs/filtered_gene_bc_matrices_h5.h5
- Unfiltered gene-barcode matrices MEX:  /spin1/users/wresch/test_data/cellranger/s1/outs/raw_gene_bc_matrices
- Unfiltered gene-barcode matrices HDF5: /spin1/users/wresch/test_data/cellranger/s1/outs/raw_gene_bc_matrices_h5.h5
- Secondary analysis output CSV:         /spin1/users/wresch/test_data/cellranger/s1/outs/analysis
- Per-molecule read information:         /spin1/users/wresch/test_data/cellranger/s1/outs/molecule_info.h5

Pipestance completed successfully!

Saving pipestance info to s1/s1.mri.tgz
node$ exit
biowulf$

The same job could also be run in cluster mode where pipeline tasks are submitted as batch jobs. This can be done by setting jobmode to slurm and limiting the max. number of concurrent jobs:

node$ cellranger count --id s1 \
                --fastqs H35KCBCXY/outs/fastq_path \
                --transcriptome=$CELLRANGER_REF/GRCh38 \
                --indices=SI-P03-C9 \
                --cells=1000 \
                --localcores=$SLURM_CPUS_PER_TASK \
                --localmem=34 \
                --jobmode=slurm --maxjobs=10

Don't forget to close the interactive session when done

node$ exit
biowulf$ 

Though in the case of this small example this actually results in a longer overall runtime. Even when running in cluster mode, please run the main pipeline in an sinteractive session or as a batch job itself.

Batch job on Biowulf

For a local job (i.e. all computation done as part of a single job), create a batch script similar to the following example:

#! /bin/bash
# this file is cellranger_batch.sh
module load cellranger || exit 1
cellranger mkfastq --run=llranger-tiny-bcl-1.2.0 \
        --samplesheet=cellranger-tiny-bcl-samplesheet-1.2.0.csv \
        --localcores=$SLURM_CPUS_PER_TASK \
        --localmem=34
cellranger count --id s1 \
        --fastqs H35KCBCXY/outs/fastq_path \
        --transcriptome=$CELLRANGER_REF/GRCh38 \
        --indices=SI-P03-C9 \
        --cells=1000 \
        --localcores=$SLURM_CPUS_PER_TASK \
        --localmem=34 \
        --jobmode=slurm --maxjobs=20

Submit to the queue with sbatch:

biowulf$ sbatch --mem=35g --cpus-per-task=12 cellranger_batch.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following example:

# this file is cellranger.swarm
cellranger mkfastq --run=./run1 --localcores=$SLURM_CPUS_PER_TASK --localmem=34
cellranger mkfastq --run=./run2 --localcores=$SLURM_CPUS_PER_TASK --localmem=34
cellranger mkfastq --run=./run3 --localcores=$SLURM_CPUS_PER_TASK --localmem=34

And submit to the queue with swarm

biowulf$ swarm -f cellranger.swarm -g 35 -t 16