High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
CGAT on Biowulf & Helix

Description

The CGAT code collection has grown out of the work in comparative genomics by the Ponting group in the last decade. Now, CGAT has added functionality to do next-generation sequencing analysis.

There may be multiple versions of CGAT available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail CGAT 

To select a module use

module load CGAT/[version]

where [version] is the version of choice.

Environment variables set

Dependencies

Dependencies are loaded automatically by the CGAT environment module.

Documentation

On Helix

Load the module

helix$ module load CGAT
[+] Loading rpy2 2.7.0 (R version 3.2.2) ...
[+] Loading bedtools 2.25.0
[+] Loading UCSC Utilities 314 ...
[+] Loading CGAT 0.2.5 ...

Let's create some bed files that overlap as follows and count how many bed files have intervals overlapping the union of all intervals

                 1         2         3         4
       012345678901234567890123456789012345678901234
a.bed: -------          -----               -------
b.bed:      -----        --
c.bed:  ---

Union: ----------       -----               -------
helix$ cat | tr '|' '\t' > a.bed <<EOF
chr1|0|7
chr1|17|22
chr1|37|44
EOF
helix$ cat | tr '|' '\t' > b.bed <<EOF
chr1|5|10
chr1|18|20
EOF
helix$ cat | tr '|' '\t' > c.bed <<EOF
chr1|1|4
EOF
helix$ cgat beds2counts a.bed b.bed c.bed
[...snip...]
contig  start   end     count
## 2015-07-30 14:03:17,818 INFO outputting result
chr1    37      44      1
chr1    17      22      2
chr1    0       10      3

Collect some summary stats on a fasta file

helix$ cgat fasta2table --section=length,na,dn < $CGAT_TESTDATA/na_test.fasta
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is cgat_job.sh

module load CGAT || exit 1
cgat bam2wiggle --output-format=bigwig \
  $CGAT_TESTDATA/paired_shifted.bam paired.bw

Submit to the queue with sbatch:

biowulf$ sbatch cgat_job.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following example:

# this file is cgat_jobs.swarm
cgat bam2wiggle --output-format=bigwig in1.bam out1.bw
cgat bam2wiggle --output-format=bigwig in2.bam out2.bw
cgat bam2wiggle --output-format=bigwig in3.bam out3.bw

And submit to the queue with swarm

biowulf$ swarm -f cgat_jobs.swarm
Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

biowulf$ sinteractive 
node$ module load CGAT
node$ cgat fasta2table ...
node$ exit
biowulf$