Biowulf High Performance Computing at the NIH
mbin on Biowulf

The mBin pipeline is designed to discover the unique signals of DNA methylation in metagenomic SMRT sequencing reads and leverage them for organism binning of assembled contigs or unassembled reads. Because all cellular DNA is modified by the same set of methyltransferases encoded in the genome, DNA methylation signals can be used for binning not just chromosomal sequences, but also extrachromosomal mobile genetic elements like plasmids.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive -c 4 --mem 10g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load mbin

[user@cn3144 ~]$ cp $MBIN_TEST_DATA/* .

[user@cn3144 ~]$ buildcontrols -i --procs=$SLURM_CPUS_PER_TASK --control_pkl_name=control_means.pkl aligned_reads.cmp.h5 
2019-04-18 14:37:44 [INFO] Initiating dictionary of all possible motifs...
2019-04-18 14:37:44 [INFO]   - Adding 256 4-mer motifs...
2019-04-18 14:37:44 [INFO] Done: 256 possible contiguous motifs

2019-04-18 14:37:44 [INFO]   - Adding 1024 5-mer motifs...
2019-04-18 14:37:44 [INFO] Done: 1536 possible contiguous motifs

2019-04-18 14:37:44 [INFO]   - Adding 4096 6-mer motifs...
2019-04-18 14:37:44 [INFO] Done: 7680 possible contiguous motifs

2019-04-18 14:37:44 [INFO]   - Adding bipartite motifs to search space...
2019-04-18 14:37:45 [INFO] Done: 194560 possible bipartite motifs
2019-04-18 14:38:37 [INFO]   - Control data: chunk 98/100
2019-04-18 14:38:38 [INFO] Combining motifs from all chunks of control data...
2019-04-18 14:38:38 [INFO] Done.
2019-04-18 14:38:38 [INFO] 
2019-04-18 14:38:38 [WARNING] WARNING: could not find sufficient instances (>=10) for 197495 motifs (out of 202240 total) in control data!
2019-04-18 14:38:38 [WARNING]    * If this is alarming, try reducing --min_motif_count or increasing --N_reads, although you just might not have those motifs in your reference sequence.
2019-04-18 14:38:38 [INFO] 
2019-04-18 14:38:38 [INFO] Writing control data to a pickled file: /spin1/scratch/teja/test/control_means.pkl
2019-04-18 14:38:39 [INFO] 
2019-04-18 14:38:39 [INFO] Cleaning up temp files from control data processing...
mBin control extraction has finished running. See log for details.

[user@cn3144 ~]$ mapfeatures --help
Usage: mapfeatures [--help] [options] _methyl_features.txt _other_features.txt

  mapfeatures visualizes the landscape of high-dimensional sequence features 
  using the Barnes Hut approximation of t-SNE (PCA support coming soon). The 
  sequence features that are output from methylprofiles are often high-
  dimensional (>3D), making it difficult to visualize the sequences. To ease 
  this visualization for resolution of discrete sequence clusters in the feature 
  space, t-SNE is used to reduce the dimensionality of the methylation, 
  composition, and coverage features to 2D. The resulting 2D maps, which can be 
  overlaid with sequence annotation labels (generated with Kraken, for instance), 
  often reveals sequence clustering in the 2D feature space representing distinct
  taxonomical groups for binning.

  Two files from methylprofiles serve as input to mapfeatures: 
     (1) __methyl_features.txt
     (2) __other_features.txt 

   is defined by --prefix and  is the sequence data type: contig, align, 
  or read.
  -h, --help            show this help message and exit
  -d, --debug           Increase verbosity of logging
  -i, --info            Add basic logging
  --logFile=LOGFILE     Write logging to file [log.controls]
  --prefix=PREFIX       Prefix to use for output files [None]
  --size_markers        Adjust marker size in plot according to sequence
                        length [False]
                        Dimensionality reduction algorithm to apply (bhtsne or
                        pca) [bhtsne]
  --labels=LABELS       Tab-delimited file (no header) of sequence labels
                        (seq_name     label_name) [None]
  --l_min=L_MIN         Minimum read length to include for analysis [0]
  --n_seqs=N_SEQS       Number of sequences to subsample [all]
  --n_dims=N_DIMS       Number of dimensions to reduce to for visualization
                        (only n_dims=2 will be plotted) [2]
  --n_iters=N_ITERS     Number of iterations to use for BH-tSNE [500]

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. For example:

set -e
module load mbin
buildcontrols -i --procs=$SLURM_CPUS_PER_TASK --control_pkl_name=control_means.pkl aligned_reads.cmp.h5

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=8 --mem=20g
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. mbin.swarm). For example:

buildcontrols -i --procs=$SLURM_CPUS_PER_TASK --control_pkl_name=control1_means.pkl aligned_reads1.cmp.h5
buildcontrols -i --procs=$SLURM_CPUS_PER_TASK --control_pkl_name=control2_means.pkl aligned_reads2.cmp.h5
buildcontrols -i --procs=$SLURM_CPUS_PER_TASK --control_pkl_name=control3_means.pkl aligned_reads3.cmp.h5

Submit this job using the swarm command.

swarm -f mbin.swarm -g 20 -t 8 --module mbin
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module mbin Loads the mbin module for each subjob in the swarm