This module makes available the tools used in the GTEX RNA-Seq pipeline. Also planned is the implementation of a combined pipeline but that is not yet ready.
GTEX_RNASEQ_REF
GTEX_RNASEQ_TEST_DATA
Allocate an interactive session and run the individual steps of the pipeline. Note that only some of the steps are resource intensive. For production work please run individual steps as batch jobs with the appropriate resource allocations. This full size example takes some hours to run even though it is downsampled to just 10M reads per sample.
Note that in this interactive examples all steps are run as part of the same allocation. That is ok for testing and debugging, but since the steps require quite different resources (e.g. star scales well to multiple threads and requires around 45GB of memory, rnaseqc and markduplicates are mostly single threaded) for production each step should be run as a separate job.
[user@biowulf]$ sinteractive --mem=40g --time=1-12 --cpus-per-task=16 --gres=lscratch:400 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ module load gtex_rnaseq # temp files for STAR are located in the same directory as output so better be in lscratch [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ cp -Lr ${GTEX_RNASEQ_TEST_DATA:-none} ./data [user@cn3144]$ samples=( HepG2_ENCLB059ZZZ HepG2_ENCLB060ZZZ K562_ENCLB063ZZZ K562_ENCLB064ZZZ ) [user@cn3144]$ for sample in "${samples[@]}" ; do run_STAR.py \ --output_dir data/star_out \ -t $SLURM_CPUS_PER_TASK \ ${GTEX_RNASEQ_REF}/GRCh38_release38/star_index_oh100 \ data/${sample}_R1.fastq.gz \ data/${sample}_R2.fastq.gz \ ${sample} done # note that this script changes into the output directory, # so the path to the input bam has to be absolute or relative to # that output directory. Default memory allocation is 3GB. If you get # garbage collection or memory errors with 3GB allocate more like # in this example when we use 6GB [user@cn3144]$ for sample in "${samples[@]}" ; do run_MarkDuplicates.py -m 6 \ -o data \ star_out/${sample}.Aligned.sortedByCoord.out.bam \ ${sample}.Aligned.sortedByCoord.out.md done # need to manually index the markduplicate bam output [user@cn3144]$ module load samtools [user@cn3144]$ for sample in "${samples[@]}" ; do samtools index data/${sample}.Aligned.sortedByCoord.out.md.bam done # - Syntax for the V8 pipeline: # - need to use the older java # - need more memory [user@cn3144]$ for sample in "${samples[@]}" ; do run_rnaseqc.py \ --output_dir=$PWD/data \ --java /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java \ --memory=16 \ ${sample}.Aligned.sortedByCoord.out.md.bam \ ${GTEX_RNASEQ_REF}/GRCh38_release38/gencode.v38.primary_assembly.genes.gtf \ ${GTEX_RNASEQ_REF}/GRCh38_release38/GRCh38.primary_assembly.genome.fa \ ${sample} done # - Syntax for the V10 pipeline. Need to change to the directory for this # step to work [user@cn3144]$ pushd data [user@cn3144]$ for sample in "${samples[@]}" ; do run_rnaseqc.py \ ${GTEX_RNASEQ_REF}/GRCh38_release38/gencode.v38.primary_assembly.genes.gtf \ ${sample}.Aligned.sortedByCoord.out.md.bam \ ${sample} done [user@cn3144]$ popd [user@cn3144]$ for sample in "${samples[@]}" ; do run_RSEM.py \ --threads $SLURM_CPUS_PER_TASK \ ${GTEX_RNASEQ_REF}/GRCh38_release38/rsem_reference \ $PWD/data/star_out/${sample}.Aligned.toTranscriptome.out.bam \ $PWD/data/${sample} done [user@cn3144]$ combine_GCTs.py data/*.exon_reads.gct.gz data/combined_gcts_exon_reads # copy relevant results back [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Create a batch input file (e.g. gtex_rnaseq.sh), which uses the input file 'gtex_rnaseq.in'. For example:
#!/bin/bash module load gtex_rnaseq/V8 || exit 1 wd=$PWD sample=HepG2_ENCLB059ZZZ cd /lscratch/$SLURM_JOB_ID || exit 1 mkdir data cp -L ${GTEX_RNASEQ_TEST_DATA:-none}/${sample}* ./data run_STAR.py \ --output_dir $PWD/data/star_out \ -t $SLURM_CPUS_PER_TASK \ ${GTEX_RNASEQ_REF}/GRCh38_release38/star_index_oh100 \ data/${sample}_R1.fastq.gz \ data/${sample}_R2.fastq.gz \ ${sample} cp -r data $wd
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=16 --mem=40g --gres=lscratch:75 gtex_rnaseq.sh