Biowulf High Performance Computing at the NIH
tetoolkit on Biowulf

A package for including transposable elements in differential enrichment analysis of sequencing datasets.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=10g --cpus-per-task=12 --gres=lscratch:50
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load tetoolkit
[user@cn3144]$ cp -rL $TETOOLKIT_TEST_DATA/01fastq .
[user@cn3144]$ cp $TETOOLKIT_TEST_DATA/GTF/dm6_rmsk_TE.gtf.gz .
[user@cn3144]$ gunzip dm6_rmsk_TE.gtf.gz

This test data contains 4 untreated samples (GSM461176, GSM461177, GSM461178, GSM461182) and 3 samples with RNAi depleted Pasilla gene expression (GSM461179, GSM461180, GSM461181). First, let's align it in a way that is compatible with TEtranscripts. This will take about 1h.

[user@cn3144]$ module load STAR/2.7.3a
[user@cn3144]$ genome=/fdb/STAR_indices/2.7.3a/UCSC/dm6/genes-75
[user@cn3144]$ mkdir -p 02aln
[user@cn3144]$ for fq in 01fastq/*.fastq.gz ; do
     STAR --runThreadN $SLURM_CPUS_PER_TASK \
          --genomeDir $genome \
          --readFilesCommand zcat \
          --outSAMtype BAM Unsorted \
          --winAnchorMultimapNmax 200 \
          --outFilterMultimapNmax 100 \
          --readFilesIn "$fq" \
          --outFileNamePrefix "./02aln/$(basename $fq .fastq.gz)"
 done

Now we can run TEtoolkit on the alignments.

[user@cn3144]$ untreated=(
     02aln/GSM461176Aligned.out.bam
     02aln/GSM461177Aligned.out.bam
     02aln/GSM461178Aligned.out.bam
     02aln/GSM461182Aligned.out.bam )
[user@cn3144]$ treated=(
     02aln/GSM461179Aligned.out.bam
     02aln/GSM461180Aligned.out.bam
     02aln/GSM461181Aligned.out.bam )
[user@cn3144]$ TEtranscripts --mode multi \
    --TE dm6_rmsk_TE.gtf \
    --GTF /fdb/STAR_indices/2.7.3a/UCSC/dm6/genes.gtf \
    --minread 100 \
    --project test \
    -c "${untreated[@]}" \
    -t "${treated[@]}"
...
[user@cn3144]$ ls -1
drwxr-xr-x 2 user group 4.0K Apr 10 08:50 01fastq
drwxr-xr-x 2 user group 4.0K Apr 10 09:38 02aln
-rw-r--r-- 1 user group 5.3M Apr 10 08:51 dm6_rmsk_TE.gtf
-rw-r--r-- 1 user group 521K Apr 10 14:13 test.cntTable
-rw-r--r-- 1 user group  735 Apr 10 14:13 test_DESeq2.R
-rw-r--r-- 1 user group 878K Apr 10 14:13 test_gene_TE_analysis.txt
-rw-r--r-- 1 user group  94K Apr 10 14:13 test_sigdiff_gene_TE.txt

[user@cn3144]$ head -5 test_sigdiff_gene_TE.txt
baseMean        log2FoldChange  lfcSE   stat    pvalue  padj
AGBE    1668.50772310032        0.489686064951153       0.162938655205519       3.00534004244418        0.00265284088823233   0.0288848429674012
AGO2    8993.79284144723        -0.265578080875274      0.0892885493032728      -2.97438006270239       0.0029358119956592    0.0313958714926737
ATP7    1750.30129130752        0.526853871760357       0.161346748429612       3.26535165343106        0.00109328255464026   0.0139041285415587
Aats-cys        1256.3667600502 -0.670235513690024      0.105205551162389       -6.37072384759898       1.8813812974601e-10   1.13699601511619e-08

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. tetoolkit.sh), which uses the input file 'tetoolkit.in'. For example:

#! /bin/bash
# this is tetranscripts.sh

module load tetoolkit/2.1.4 || exit 1
untreated=(
     02aln/GSM461176Aligned.out.bam
     02aln/GSM461177Aligned.out.bam
     02aln/GSM461178Aligned.out.bam
     02aln/GSM461182Aligned.out.bam )
treated=(
     02aln/GSM461179Aligned.out.bam
     02aln/GSM461180Aligned.out.bam
     02aln/GSM461181Aligned.out.bam )
TEtranscripts --mode multi \
    --TE dm6_rmsk_TE.gtf \
    --GTF /fdb/STAR_indices/2.7.3a/UCSC/dm6/genes.gtf \
    --minread 100 \
    --project test \
    -c "${untreated[@]}" \
    -t "${treated[@]}"

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=2 --mem=10g tetoolkit.sh