High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
tetoolkit on Biowulf & Helix

Description

TEToolkit contains two tools

  1. TEpeaks is an extension of the MACS peak finder for chip seq data. It takes into account reads with more than one possible alignment, offers a different normalization method, and uses DESeq for differential binding analysis. Allowing ambiguously mapped reads allows identification of enriched repetitive regions.
  2. TEtranscripts takes RNA-Seq data, annotates expression of genes and transposable elements and uses DESeq for differential expression analysis.

Data

Annotated GTF files for selected organisms can be found in

/fdb/tetoolkit

References

Web sites

On Helix
Set up the environment and run TEtranscripts on a single end test data set available in /usr/local/apps/tetoolkit/TEST_DATA
helix$ module load tetoolkit
helix$ cp /usr/local/apps/tetoolkit/TEST_DATA/testdata_SE/* .
helix$ cp /usr/local/apps/tetoolkit/TEST_DATA/testdata_GTF/dm3* .
helix$ gunzip *.gz
helix$ TEtranscripts --sortByPos --mode multi \
  --TE dm3_rmsk_TE.gtf --GTF dm3_refGene.gtf \
  --project singleEnd_test -t test_data_SE_treatment.bam \
  -c test_data_SE_control.bam
helix$ ls -1
dm3_refGene.gtf
dm3_rmsk_TE.gtf
singleEnd_test.cntTable
singleEnd_test_DESeq.R
singleEnd_test_gene_TE_analysis.txt
singleEnd_test_sigdiff_gene_TE.txt
test_data_SE_control.bam
test_data_SE_treatment.bam
helix$ head singleEnd_test.cntTable
gene/TE test_data_SE_treatment.bam.T    test_data_SE_control.bam.C
"128up" 32      15
"14-3-3epsilon" 471     442
"14-3-3zeta"    449     382
"140up" 5       5
"18w"   15      11
"26-29-p"       34      43
"2mit"  0       0
"312"   2       3
"4EHP"  14      24
Batch job on Biowulf

Set up a batch script similar to the following:

#! /bin/bash
# this is tetranscripts.sh

module load tetoolkit || exit 1
TEtranscripts --sortByPos --mode multi --verbose \
  --TE dm3_rmsk_TE.gtf --GTF dm3_refGene.gtf \
  --project singleEnd_test -t test_data_SE_treatment.bam \
  -c test_data_SE_control.bam 2> log

Submit to the batch queue with

b2$ sbatch tetranscripts.sh
Swarm of jobs on Biowulf

Set up a swarm file similar to the following (one command per line with (optional) line continuations):

TEtranscripts --sortByPos --mode multi \
  --TE dm3_rmsk_TE.gtf --GTF dm3_refGene.gtf \
  --project singleEnd_test1 -t treatment1.bam -c control1.bam
TEtranscripts --sortByPos --mode multi \
  --TE dm3_rmsk_TE.gtf --GTF dm3_refGene.gtf \
  --project singleEnd_test2 -t treatment2.bam -c control2.bam

Submit the jobs to the batch system with

b2$ swarm -f tetranscripts.swarm
Interactive job on Biowulf

Allocate an interative session and then use as described in the previous sections

b2$ sinteractive
salloc.exe: Pending job allocation 5568277
salloc.exe: job 5568277 queued and waiting for resources
salloc.exe: job 5568277 has been allocated resources
salloc.exe: Granted job allocation 5568277
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1883 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
cn1883$ TEtranscripts --sortedByPos --mode multi \
  --TE dm3_rmsk_TE.gtf --GTF dm3_refGene.gtf \
  --project singleEnd_test -t test_data_SE_treatment.bam \
  -c test_data_SE_control.bam
cn1883$ exit
b2$
Documentation