A package for including transposable elements in differential enrichment analysis of sequencing datasets.
$TETOOLKIT_TEST_DATA
/fdb/tetoolkit/
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=10g --cpus-per-task=12 --gres=lscratch:50 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ module load tetoolkit [user@cn3144]$ cp -rL $TETOOLKIT_TEST_DATA/01fastq . [user@cn3144]$ cp $TETOOLKIT_TEST_DATA/GTF/dm6_rmsk_TE.gtf.gz . [user@cn3144]$ gunzip dm6_rmsk_TE.gtf.gz
This test data contains 4 untreated samples (GSM461176, GSM461177, GSM461178, GSM461182) and 3 samples with RNAi depleted Pasilla gene expression (GSM461179, GSM461180, GSM461181). First, let's align it in a way that is compatible with TEtranscripts. This will take about 1h.
[user@cn3144]$ module load STAR/2.7.3a [user@cn3144]$ genome=/fdb/STAR_indices/2.7.3a/UCSC/dm6/genes-75 [user@cn3144]$ mkdir -p 02aln [user@cn3144]$ for fq in 01fastq/*.fastq.gz ; do STAR --runThreadN $SLURM_CPUS_PER_TASK \ --genomeDir $genome \ --readFilesCommand zcat \ --outSAMtype BAM Unsorted \ --winAnchorMultimapNmax 200 \ --outFilterMultimapNmax 100 \ --readFilesIn "$fq" \ --outFileNamePrefix "./02aln/$(basename $fq .fastq.gz)" done
Now we can run TEtoolkit on the alignments.
[user@cn3144]$ untreated=( 02aln/GSM461176Aligned.out.bam 02aln/GSM461177Aligned.out.bam 02aln/GSM461178Aligned.out.bam 02aln/GSM461182Aligned.out.bam ) [user@cn3144]$ treated=( 02aln/GSM461179Aligned.out.bam 02aln/GSM461180Aligned.out.bam 02aln/GSM461181Aligned.out.bam ) [user@cn3144]$ TEtranscripts --mode multi \ --TE dm6_rmsk_TE.gtf \ --GTF /fdb/STAR_indices/2.7.3a/UCSC/dm6/genes.gtf \ --minread 100 \ --project test \ -c "${untreated[@]}" \ -t "${treated[@]}" ... [user@cn3144]$ ls -1 drwxr-xr-x 2 user group 4.0K Apr 10 08:50 01fastq drwxr-xr-x 2 user group 4.0K Apr 10 09:38 02aln -rw-r--r-- 1 user group 5.3M Apr 10 08:51 dm6_rmsk_TE.gtf -rw-r--r-- 1 user group 521K Apr 10 14:13 test.cntTable -rw-r--r-- 1 user group 735 Apr 10 14:13 test_DESeq2.R -rw-r--r-- 1 user group 878K Apr 10 14:13 test_gene_TE_analysis.txt -rw-r--r-- 1 user group 94K Apr 10 14:13 test_sigdiff_gene_TE.txt [user@cn3144]$ head -5 test_sigdiff_gene_TE.txt baseMean log2FoldChange lfcSE stat pvalue padj AGBE 1668.50772310032 0.489686064951153 0.162938655205519 3.00534004244418 0.00265284088823233 0.0288848429674012 AGO2 8993.79284144723 -0.265578080875274 0.0892885493032728 -2.97438006270239 0.0029358119956592 0.0313958714926737 ATP7 1750.30129130752 0.526853871760357 0.161346748429612 3.26535165343106 0.00109328255464026 0.0139041285415587 Aats-cys 1256.3667600502 -0.670235513690024 0.105205551162389 -6.37072384759898 1.8813812974601e-10 1.13699601511619e-08 [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Create a batch input file (e.g. tetoolkit.sh), which uses the input file 'tetoolkit.in'. For example:
#! /bin/bash # this is tetranscripts.sh module load tetoolkit/2.1.4 || exit 1 untreated=( 02aln/GSM461176Aligned.out.bam 02aln/GSM461177Aligned.out.bam 02aln/GSM461178Aligned.out.bam 02aln/GSM461182Aligned.out.bam ) treated=( 02aln/GSM461179Aligned.out.bam 02aln/GSM461180Aligned.out.bam 02aln/GSM461181Aligned.out.bam ) TEtranscripts --mode multi \ --TE dm6_rmsk_TE.gtf \ --GTF /fdb/STAR_indices/2.7.3a/UCSC/dm6/genes.gtf \ --minread 100 \ --project test \ -c "${untreated[@]}" \ -t "${treated[@]}"
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=2 --mem=10g tetoolkit.sh