Biowulf High Performance Computing at the NIH
seqOutBias on Biowulf

From the seqOutBias documentation:

Molecular biology enzymes have nucleic acid preferences for their substrates; the preference of an enzyme is typically dictated by the sequence at or near the active site of the enzyme. This bias may result in spurious read count patterns when used to interpret high-resolution molecular genomics data. The seqOutBias program aims to correct this issue by scaling the aligned read counts by the ratio of genome-wide observed read counts to the expected sequence based counts for each k-mer. The sequence based k-mer counts take into account mappability at a given read length using Genome Tools’ Tallymer program.
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. In this example we will create a normalized bigWig of Mouse ENCODE DNase-Seq data (cell line NIH3T3) and save the resulting kmer index files for later use. Note that the original indexing run takes time.


[user@biowulf]$ sinteractive --mem=40g --gres=lscratch:150
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load seqoutbias
[user@cn3144]$ cp /fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa mm10.fa
[user@cn3144]$ # fetch mouse encode DNase-Seq data, already aligned
[user@cn3144]$ wget https://www.encodeproject.org/files/ENCFF409IYK/@@download/ENCFF409IYK.bam
[user@cn3144]$ # limit the bam file to contigs in the genome used
[user@cn3144]$ module load samtools
[user@cn3144]$ samtools index ENCFF409IYK.bam
[user@cn3144]$ samtools view -h ENCFF409IYK.bam chr{1..19} chrM chrY chrX \
                      | egrep -v 'chrUn_|_GL|_JH|chr_rDNA' \
                      | samtools view -b - \
                      > ENCFF409IYK_filtered.bam
[user@cn3144]$ # create scaled output
[user@cn3144]$ seqOutBias mm10.fa ENCFF409IYK_filtered.bam --read-size=36 \
                    --kmer-size=6 --bw=ENCFF409IYK_scaled.bw --plus-offset=3 \
                    --minus-offset=3 --shift-counts --skip-bed
[user@cn3144]$ ls -lh
-rw-r--r-- 1 user group 1.1G Jun 29 08:06 ENCFF409IYK_filtered.bam
-rw-r--r-- 1 user group 142M Jun 29 08:15 ENCFF409IYK_scaled.bw   
-rw-r--r-- 1 user group 7.4G Jun 28 22:44 mm10_36.6.3.3.tbl       
-rwxr-xr-x 1 user group 2.6G Jun 28 17:35 mm10.fa                 
-rw-r--r-- 1 user group  136 Jun 28 17:37 mm10.sft.des            
-rw-r--r-- 1 user group 650M Jun 28 17:38 mm10.sft.esq            
-rw-r--r-- 1 user group 2.6G Jun 28 18:28 mm10.sft.lcp            
-rw-r--r-- 1 user group 1.7G Jun 28 18:28 mm10.sft.llv            
-rw-r--r-- 1 user group  726 Jun 28 17:37 mm10.sft.md5            
-rw-r--r-- 1 user group  520 Jun 28 18:28 mm10.sft.prj            
-rw-r--r-- 1 user group  168 Jun 28 17:37 mm10.sft.sds            
-rw-r--r-- 1 user group   96 Jun 28 17:38 mm10.sft.ssp            
-rw-r--r-- 1 user group  21G Jun 28 18:28 mm10.sft.suf            
-rw-r--r-- 1 user group 1.9G Jun 28 19:51 mm10.tal_36.gtTxt.gz    
-rw-r--r-- 1 user group 8.2M Jun 28 18:36 mm10.tal_36.mbd         
-rw-r--r-- 1 user group  50M Jun 28 18:36 mm10.tal_36.mct         
-rw-r--r-- 1 user group 433M Jun 28 18:36 mm10.tal_36.mer         

[user@cn3144]$ # save the index for later re-use
[user@cn3144]$ mkdir /data/user/ref/seqoutbias/mm10.36.6.3.3
[user@cn3144]$ cp mm10* /data/user/ref/seqoutbias/mm10.36.6.3.3

[user@cn3144]$ # create unscaled output
[user@cn3144]$ seqOutBias mm10.fa ENCFF409IYK_filtered.bam --read-size=36 --no-scale \
                    --kmer-size=6 --bw=ENCFF409IYK_unscaled.bw --plus-offset=3 \
                    --minus-offset=3 --shift-counts --skip-bed
[user@cn3144]$ 

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

The generation of the index uses up to 25GiB of memory for this particular run and takes ~4h. If the index was created previously, memory consumption is low (less than 2GiB). So the index has to be reused for efficient processing of multiple samples.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. seqoutbias.sh), which re-uses the the index created previously.

#!/bin/bash
module load seqoutbias/1.1.3
seqOutBias /data/user/ref/seqoutbias/mm10.36.6.3.3/mm10.fa \
  ENCFF409IYK_filtered.bam --read-size=36 \
  --kmer-size=6 --bw=ENCFF409IYK_scaled.bw --plus-offset=3 \
  --minus-offset=3 --shift-counts --skip-bed

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=2 --mem=3g seqoutbias.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. seqoutbias.swarm). For example:

seqOutBias /data/user/ref/seqoutbias/mm10.36.6.3.3/mm10.fa \
  sample1.bam --read-size=36 \
  --kmer-size=6 --bw=sample1_scaled.bw --plus-offset=3 \
  --minus-offset=3 --shift-counts --skip-bed
seqOutBias /data/user/ref/seqoutbias/mm10.36.6.3.3/mm10.fa \
  sample2.bam --read-size=36 \
  --kmer-size=6 --bw=sample2_scaled.bw --plus-offset=3 \
  --minus-offset=3 --shift-counts --skip-bed

Submit this job using the swarm command.

swarm -f seqoutbias.swarm -g 3 -t 1 -p2 --module seqoutbias/1.1.3
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module seqoutbias Loads the seqoutbias module for each subjob in the swarm