High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Telseq on Biowulf and Helix

TelSeq is a software that estimates telomere length from whole genome sequencing data (BAMs).

Running on Helix

Sample session:

helix$ module load telseq
helix$ telseq 
Program: TelSeq
Version: 0.0.1
Contact: Zhihao Ding [zd1@sanger.ac.uk]

Usage: telseq [OPTION]   <...> 
Scan BAM and estimate telomere length. 
                    one or more BAM files to be analysed. File names can also be passed from a pipe, 
                             with each row containing one BAM path.
   -f, --bamlist=STR        a file that contains a list of file paths of BAMs. It should has only one column, 
                            with each row a BAM file path. -f has higher priority than . When specified, 
                             are ignored.
   -o, --output_dir=STR     output file for results. Ignored when input is from stdin, in which case output will be stdout. 
   -H                       remove header line, which is printed by default.
   -h                       print the header line only. The text can be used to attach to result files, useful
                            when the headers of the result files are suppressed. 
   -m                       merge read groups by taking a weighted average across read groups of a sample, weighted by 
                            the total number of reads in read group. Default is to output each readgroup separately.
   -u                       ignore read groups. Treat all reads in BAM as if they were from a same read group.
   -k                       threshold of the amount of TTAGGG/CCCTAA repeats in read for a read to be considered telomeric. default = 7.

Testing functions
------------
   -r                       read length. default = 100
   -z                       use user specified pattern for searching [ATGC]*.
   -e, --exomebed=STR       specifiy exome regions in BED format. These regions will be excluded 
   -w,                      consider BAMs in the speicfied bamlist as one single BAM. This is useful when 
                            the initial alignemt is separated for some reason, such as one for mapped and one for ummapped reads. 
   --help                   display this help and exit

Report bugs to zd1@sanger.ac.uk

Submitting a single batch job

1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of program location before running.

#!/bin/bash 

module load telseq
cd /data/$USER/somewhere
telseq infile
....
....

2. Submit the script on Biowulf.

$ sbatch myscript

see biowulf user guide for more options such as allocate more memory and longer walltime.

Submitting a swarm of jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:

cd /data/user/run1/; telseq infile
cd /data/user/run2/; telseq infile
cd /data/user/run3/; telseq infile
........

The -f flag is required to specify swarm file name.

Submit the swarm job:

$ swarm -f swarmfile --module telseq

- Use -g flag for more memory requirement (default 1.5gb per line in swarmfile)

- Use --time flag for longer walltime (default 4 hours)

For more information regarding running swarm, see swarm.html

 

Running an interactive job

User may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf]$ sinteractive 

[user@pXXXX]$ cd /data/$USER/myruns

[user@pXXXX]$ module load telseq

[user@pXXXX]$ telseq infile
[user@pXXXX] exit
slurm stepepilog here!
                   
[user@biowulf]$ 

Documentation

https://github.com/zd1/telseq