High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
HTSeq on Helix & Biowulf

HTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays. It is developed by Simon Anders at EMBL Heidelberg.

Example files can be downloaded from http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz.

For more detailed command example, see http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html#tour

Running on Helix
$ module load htseq
$ htseq-count [options] alignment_file gff_file

Running a batch job

First create 2 scripts for each sample you are going to process.

The first file is the python script that you plan to use for HTSeq alone the following lines:

#!/usr/local/Python/2.7.8/bin/python
import HTSeq
fastq_file = HTSeq.FastqReader( "yeast_RNASeq_excerpt_sequence.txt", "solexa" )
other HTSeq command1
other HTSeq command2
.....
.....

Let's just call this Python script /data/$USER/htseq/run1/htseq1.py

The second script will call this Python script and looks something like this:

#!/bin/bash

module load htseq

cd /data/$USER/htseq/run1
/data/$USER/htseq/run1/htseq1.py

submit the second script from biowulf:

$ sbatch --mem=10g myscript
Running an interactive job

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.


[user@biowulf] $ sinteractive --mem=10g      
salloc.exe: Granted job allocation 1528 slurm stepprolog here! 
[user@pXXXX]$ cd /data/$USER/myruns

[user@pXXXX]$ module load htseq
        
[user@pXXXX]$ cd /data/$USER/mydirectory

[user@pXXXX]$ python
Python 2.7.2 (default, Aug 15 2011, 13:51:43) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>> fastq_file = HTSeq.FastqReader( "yeast_RNASeq_excerpt_sequence.txt", "solexa" )
>>> [...etc...]
>>> quit()

[user@pXXXX] exit
qsub: job 2236960.biobos completed
[user@pXXXX]$ 

Running a swarm job

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:

cd /data/$USER/Dir1; htseq-count [options] sam_file gff_file
cd /data/$USER/Dir2; htseq-count [options] sam_file gff_file [.....] cd /data/$USER/Dir15; htseq-count [options] sam_file gff_file

The '-f' and '--module' options for swarm are required

swarm -f swarmFile --module htseq

To request more memory, use -g, For example, to request 10g:

swarm -g 10 -f swarmFile --module htseq

For more information regarding running swarm, see swarm.html

 

Documentation

HTSeq documentation at embl.de