High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Pseudogenome Tools on Biowulf & Helix

Pseudogenome Tools is a suite of tools that simplify the incorporation of pseudogenomes into standard analysis and hiseq pipelines.

Modtools is used to generate standard reference genome and pseudogenome sequences.

Lapels is used to remap pseudogenome alignments, in the form of a BAM file, back to the reference sequence. This entails the removal of all indels (via the cigar string modifications, the underlying sequence is unaltered) and adjustments to the fragment and its mate's starting positions. Lapels also annotates the number and types (SNPs, insertions, and deletions) of sequence variants seen in each read. The input includes the BAM file of psedogenome alignment and the MOD file associated with the FASTA sequences used in the alignment. (Please bundle MOD and FASTA while downloading.) The output is a BAM file with corrected reads positions, cigar strings, and annotated tags. It has been tested to be compatible with downstream tools, such as IGV (using the reference genome) and Cufflinks (using any referenced based transcript library). 

Suspenders merges the results of multiple alignments (BAM files) applied to the same set of reads. It is used when working with F1 and RIX crosses, where we suggest performing separate alignments to each parental genome. Suspenders then effectively merges and annotates these separate BAM files into a single consensus BAM file. When reads map to the same genomic location in both alignments, only one read is output. Where there are differences in either mapping positions or multiplicity of reads, Suspenders determines the most likely alignment and source genome for the read, which is sent to the output BAM file. When there is no significant difference in the alignments all multiple mappings are output.

The following apps belong to this pacakge:

hap2mod  get_refmeta	vcf2mod		insilico	refmaker	modstat		modmap		pylapels	fixmate	pysuspenders

Running on Helix

$ module load pseudogenome
$ cd /data/$USER/
$ vcf2mod  
usage: vcf2mod [-h] [-q | -v] [-f] [-n] [-a alias.csv] [-c chromList] [-o mod] \
ref_name ref_meta_fn sample_name vcf [vcf ...]

Running a single batch job on Biowulf

1. Create a script file similar to the lines below.


module load pseudogenome
cd /data/$USER/
pylapels -p $SLURM_CPUS_PER_TASK in.mod in.bam

2. Submit the script on biowulf:

$ sbatch --cpus-per-task=4 jobscript

--cpus-per-task: Some binaries in this package can run multithreaded using -p flag in the command. Use $SLURM_CPUS_PER_TASK in the script and --cpus-per-task flag when submit the job if run multithreaded.

For more memory requirement (default 4gb), use --mem flag:

$ sbatch --mem=10g jobscript

Running a swarm of jobs on Biowulf

Setup a swarm command file:

  cd /data/$USER/dir1; pylapels -p $SLURM_CPUS_PER_TASK in.mod in.bam
  cd /data/$USER/dir2; pylapels -p $SLURM_CPUS_PER_TASK in.mod in.bam
  cd /data/$USER/dir3; pylapels -p $SLURM_CPUS_PER_TASK in.mod in.bam

Submit the swarm file:

  $ swarm -f swarmfile -t 4 --module pseudogenome

-f: specify the swarmfile name
-t: running multi-threaded way, use '-p $SLURM_CPUS_PER_TASK' in the script
--module: set environmental variables for each command line in the file

To allocate more memory, use -g flag:

  $ swarm -f swarmfile -g 10 --module pseudogenome

-g: allocate more memory

For more information regarding running swarm, see swarm.html

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive 
salloc.exe: Granted job allocation 16535

cn999$ module load pseudogenome
cn999$ cd /data/$USER/
cn999$ pylapels in.mod in.bam

cn999$ exit


Make sure to exit the job once finished.

If more memory is needed, use --mem flag. For example

biowulf$ sinteractive --mem=10g



type command -h on the command prompt