gtex_rnaseq on Biowulf

This module makes available the tools used in the GTEX RNA-Seq pipeline. Also planned is the implementation of a combined pipeline but that is not yet ready.

[user@biowulf]$ sinteractive --mem=40g --time=1-12 --cpus-per-task=16 --gres=lscratch:400
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load gtex_rnaseq
# temp files for STAR are located in the same directory as output so better be in lscratch
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ cp -Lr ${GTEX_RNASEQ_TEST_DATA:-none} ./data
[user@cn3144]$ samples=( HepG2_ENCLB059ZZZ 
                            K562_ENCLB064ZZZ )
[user@cn3144]$  for sample in "${samples[@]}" ; do \
        --output_dir data/star_out \
        -t $SLURM_CPUS_PER_TASK \
        ${GTEX_RNASEQ_REF}/GRCh38_release38/star_index_oh100 \
        data/${sample}_R1.fastq.gz \
        data/${sample}_R2.fastq.gz \

# note that this script changes into the output directory,
# so the path to the input bam has to be absolute or relative to
# that output directory
[user@cn3144]$ for sample in "${samples[@]}" ; do \
        -o data \
        star_out/${sample}.Aligned.sortedByCoord.out.bam \

# need to manually index the markduplicate bam output
[user@cn3144]$ module load samtools
[user@cn3144]$ for sample in "${samples[@]}" ; do
    samtools index data/${sample}

# - Syntax for the V8 pipeline:
#   - need to use the older java
#   - need more memory
[user@cn3144]$ for sample in "${samples[@]}" ; do \
        --output_dir=$PWD/data \
        --java /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java \
        --memory=16 \
        ${sample} \
        ${GTEX_RNASEQ_REF}/GRCh38_release38/gencode.v38.primary_assembly.genes.gtf \
        ${GTEX_RNASEQ_REF}/GRCh38_release38/GRCh38.primary_assembly.genome.fa \

# - Syntax for the V10 pipeline. Need to change to the directory for this
#   step to work
[user@cn3144]$ pushd data
[user@cn3144]$ for sample in "${samples[@]}" ; do \
        ${GTEX_RNASEQ_REF}/GRCh38_release38/gencode.v38.primary_assembly.genes.gtf \
        ${sample} \
[user@cn3144]$ popd 

[user@cn3144]$ for sample in "${samples[@]}" ; do \
        --threads $SLURM_CPUS_PER_TASK \
        ${GTEX_RNASEQ_REF}/GRCh38_release38/rsem_reference \
        $PWD/data/star_out/${sample}.Aligned.toTranscriptome.out.bam \

[user@cn3144]$ data/*.exon_reads.gct.gz data/combined_gcts_exon_reads

# copy relevant results back

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226

module load gtex_rnaseq/V8 || exit 1
cd /lscratch/$SLURM_JOB_ID || exit 1
mkdir data
cp -L ${GTEX_RNASEQ_TEST_DATA:-none}/${sample}* ./data \
    --output_dir $PWD/data/star_out \
    ${GTEX_RNASEQ_REF}/GRCh38_release38/star_index_oh100 \
    data/${sample}_R1.fastq.gz \
    data/${sample}_R2.fastq.gz \
cp -r data $wd

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=16 --mem=40g --gres=lscratch:75