Additional information for users of the nci-dragen partition

The nci-dragen partition as of January 2024 includes one dragen server. It has been funded by NCI/CBIIT until the end of FY 2027

Notes:

Reference genomes
back to top

Originally references were located in /staging/human. These still exist but are deprecated since reference versions are tied to dragen versions. References now are located in /staging/ref/current. The following unmodified references were obtained from Illumina:

name alt_aware alt_masked cnv graph hla methylation methylated_combined rna
chm13_v2-cnv.graph.hla.rna-9-r3.0-1 False False True True True False False True
chm13_v2-cnv.hla.rna-9-r3.0-1 False False True False True False False True
hg19-alt_masked.cnv.graph.hla.rna-9-r3.0-1 False True True True True False False True
hg19-alt_masked.cnv.hla.rna-9-r3.0-1 False True True False True False False True
hg38-alt_masked.cnv.graph.hla.rna-9-r3.0-1 False True True True True False False True
hg38-alt_masked.cnv.hla.rna-9-r3.0-1 False True True False True False False True
hs37d5-cnv.graph.hla.rna-9-r3.0-1 False False True True True False False True
hs37d5-cnv.hla.rna-9-r3.0-1 False False True False True False False True

See also /staging/ref/current/README

Efficient use of licenses
back to top

The Dragen license is metered. If you do not have access to the nci_dragen_turbo QOS please don't do more then some test runs without contacting staff@hpc.nih.gov. License usage can be optimized by creating all needed variation calls for a sample in a single run. For example, when running with a single bam file, i.e.

--bam-input /staging/${ID}/xxxxx.bam

SNV, CNV, and SV can all be called concurrently in a single run by enabling all three caller flags

FlagCalls
--enable-variant-caller trueFor Germline SNV
--enable-cnv trueFor Germline CNV
--enable-sv trueFor Germline SV

Regardless of how many of these three flags used in a single run, the license will only be charged once.

The same applies to somatic variant calling (i.e. a run that includes a tumor bam with --tumor-bam-input. However, tumor-only and somatic variant calls cannot be combined into a single run. Therrefore, ineffect, a full tumor-normal run will charge the license for 2 samples (tumor/normal + germline).

Running a batch job
back to top

Create a batch script similar to the following which aligns RNA-Seq data. Note that for fusion detection a GTF file is required. The Gencode GTF files appear to be compatble with the hg38 references.

#! /bin/bash

# set up paths etc
source /etc/profile.d/edico.sh

RUNPATH=/fdb/app_testdata/fastq/Homo_sapiens
RUNFOLDER=SRR24373805
ANALYSIS="/staging/${RUNFOLDER}-$(date +%s)"
METRICS=${ANALYSIS}/Results/MetricsOutput.tsv
RESULTPATH=${PWD}/${RUNFOLDER}-dragen-results

# clean up after run
trap 'rm -rf "/staging/${RUNFOLDER}" "${ANALYSIS}"' EXIT

cp -r "${RUNPATH}/${RUNFOLDER}" /staging || exit 100
mkdir -p "${ANALYSIS}" || exit 101
genome=/staging/ref/current/hg38-alt_masked.cnv.hla.rna-9-r3.0-1
gtf=/fdb/GENCODE/Gencode_human/release_45/gencode.v45.primary_assembly.annotation.gtf

# Running a RNA pipeline with dragen
dragen -r $genome \
    -1 /staging/${RUNFOLDER}/SRR24373805_1.fastq.gz \
    -2 /staging/${RUNFOLDER}/SRR24373805_2.fastq.gz \
    -a $gtf \
    --output-dir ${ANALYSIS} \
    --output-file-prefix RNA_test \
    --enable-rna true \
    --enable-rna-gene-fusion true \
    --RGID rg \
    --RGSM sm \
    --enable-rna-quantification=true

# copy results back to working directory
cp -r "${ANALYSIS}" "${RESULTPATH}" || exit 103

And submit with

[user@biowulf]$ sbatch --mem=0 --cpus-per-task=64 --partition nci-dragen --qos=nci_dragen_turbo dragen.sh
12345678

Note that the $ANALYSIS folder is lager than the input with Logs_Intermediates taking up most the space. The script above could be modified to only transfer a subset of files back to shared storage. Example output file generated:

[user@biowulf]$ cat ${RESULTPATH}/RNA_test.quant_metrics.csv
RNA QUANTIFICATION STATISTICS,,Library orientation,IU
RNA QUANTIFICATION STATISTICS,,Total Genes,63187
RNA QUANTIFICATION STATISTICS,,Total Transcripts,252930
RNA QUANTIFICATION STATISTICS,,Coding Genes,21567
RNA QUANTIFICATION STATISTICS,,Median transcript CV coverage,0.49
RNA QUANTIFICATION STATISTICS,,Median 5' coverage bias,0.3889
RNA QUANTIFICATION STATISTICS,,Median 3' coverage bias,0.0844
RNA QUANTIFICATION STATISTICS,,Number of genes with coverage > 1x,17094,27.05
RNA QUANTIFICATION STATISTICS,,Number of genes with coverage > 10x,12504,19.79
RNA QUANTIFICATION STATISTICS,,Number of genes with coverage > 30x,9801,15.51
RNA QUANTIFICATION STATISTICS,,Number of genes with coverage > 100x,5522,8.74
RNA QUANTIFICATION STATISTICS,,Transcript fragments,19861578,89.40
RNA QUANTIFICATION STATISTICS,,Forward transcript fragments,9990254,50.30
RNA QUANTIFICATION STATISTICS,,Ambiguous strand fragments,209080,0.94
RNA QUANTIFICATION STATISTICS,,Unknown transcript fragments,1389372,6.25
RNA QUANTIFICATION STATISTICS,,Intron fragments,596441,2.68
RNA QUANTIFICATION STATISTICS,,Intergenic fragments,114056,0.51
RNA QUANTIFICATION STATISTICS,,Fold coverage of all exons,68.68
RNA QUANTIFICATION STATISTICS,,Fold coverage of introns,0.11
RNA QUANTIFICATION STATISTICS,,Fold coverage of intergenic regions,0.03
RNA QUANTIFICATION STATISTICS,,Fold coverage of coding exons,106.68

Please send questions and comments to staff@hpc.nih.gov