High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Juicer on Biowulf & Helix

Juicer is a system for analyzing loop-resolution Hi-C experiments. It was developed in the Aiden lab at Baylor College of Medicine/Rice University. [ Juicer website]

The variable $JUICER is set when you run 'module load juicer'. To run Juicer, you need

  1. a genome. hg19 is the default, and the fasta file and bwa indexes are available in $JUICER/references/
  2. a restriction enzyme site file. DpnII is the default, and the file is available in $JUICER/restriction_sites/. If you wish to use a different restriction enzyme, you need to generate it with a command like:
    module load juicer
    generate_site_positions.py DpnII hg19
    
  3. a chrom.sizes file. This can be downloaded from UCSC (e.g. hg19). The chrom.sizes file for hg19 and mm9 is in $JUICER/references/, but for other genomes you will need to download or build the file yourself. The last column of the restriction site file is the size of the chromosome, so you can also generate a chrom.sizes file via
    awk '{print $1, "\t", $NF}' <restriction_site_file>
    
    Note that the chrom.sizes file MUST be tab-separated.

On Helix

Juicer submits jobs to the cluster, and is therefore not suitable for Helix.

On Biowulf

You will need to run juicer on the Biowulf login node. The juicer.sh script is very lightweight, and will simply set up and submit jobs to the cluster.

First create a directory for your Juicer run. A subdirectory called 'fastq' within there should contain the fastq files. For example:

/data/$USER/juicer
+-- fastq
|   +-- reads_R1.fastq
|   +-- reads_R2.fastq
The fastq files must be called filename_R1.fastq and filename_R2.fastq, as those names are built into the script. If your fastq files have different names, you can rename them or create symlinks.

For your first run we recommend using the test data, which can be copied from /usr/local/apps/juicer/examples/. e.g.

biowulf% cd /data/$USER/juicer
biowufl% cp -r /usr/local/apps/juicer/examples/fastq .

Juicer will create subdirectories aligned, HIC_tmp, debug, splits. The HIC_tmp subdirectory will get deleted at the end of the run.

By default, running juicer.sh with no options will use the hg19 reference file, and the DpnII restriction site file. You need to explicitly specify the chrom.sizes file as in the example below.

Sample screen trace:

[susanc@biowulf juicer]$ module load juicer

[susanc@biowulf juicer]$ juicer.sh -p $JUICER/references/hg19.chrom.sizes
Running Juicer version 1.5
(-: Looking for fastq files...fastq files exist
(-: Aligning files matching /data/susanc/juicer/fastq/*_R*.fastq*
 in queue norm to genome hg19 with site file /usr/local/apps/juicer/juicer-1.5/SLURM/restriction_sites/hg19_DpnII.txt
(-: Created /data/susanc/juicer/splits and /data/susanc/juicer/aligned.
 Splitting files
Submitted split job 23782370
Submitted split job 23782371
srun: job 23782372 queued and waiting for resources
srun: job 23782372 has been allocated resources
(-: Starting job to launch other jobs once splitting is complete
Submitting count_ligations job
23782878
Submitting BWA alignment job
Submitted align1 job 23782879
Submitted align2 job 23782880
Submitted merge job 23782881
Submitting count_ligations job
23782882
Submitting BWA alignment job
Submitted align1 job 23782883
Submitted align2 job 23782884
Submitted merge job 23782885
Submmitted dedup job 23782888
Submitted post-dedup job 23782889
Submitted stats0 job 23782890
Submitted stats30 job 23782891
Submitted abnormals job 23782892
(-: Finished adding all jobs... Now is a good time to get that cup of coffee..

After the juicer.sh script exits, you should see a set of jobs in running and queued state, with dependencies:

[susanc@biowulf juicer]$ sjobs
User    JobId JobName Part St  Reason      Runtime  Walltime     Nodes  CPUs     Memory    Dependency          
===============================================================================================================
susanc  2348  a147_  norm  R   ---             2:41  24:00:00      1    24  12GB/node    cn0237
susanc  2349  a147_  norm  R   ---             2:41  24:00:00      1    24  12GB/node    cn0310
susanc  2350  a147_  norm  PD  Dependency      0:00  24:00:00      1     8  20GB/node  afterok:2348,afterok:2349
susanc  2353  a147_  norm  R   ---             2:41  24:00:00      1    24  12GB/node     cn0256
susanc  2354  a147_  norm  PD  Dependency      0:00  24:00:00      1     8  20GB/node  afterok:2353
susanc  2355  a147_  norm  PD  Dependency      0:00  24:00:00      1     1  16GB/node  afterok:2350,afterok:2354
susanc  2356  a147_  norm  PD  JobHeldUser     0:00     10:00      1     1    2GB/cpu  afterok:2357
susanc  2357  a147_  norm  PD  Dependency      0:00  24:00:00      1     1   2GB/node  afterok:2355
susanc  2358  a147_  norm  PD  Dependency      0:00   1:40:00      1     1    2GB/cpu  afterok:2356
susanc  2359  a147_  norm  PD  Dependency      0:00  24:00:00      1     1   6GB/node  afterok:2358
susanc  2360  a147_  norm  PD  Dependency      0:00  24:00:00      1     1   6GB/node  afterok:2358
susanc  2361  a147_  norm  PD  Dependency      0:00  24:00:00      1     1   6GB/node  afterok:2358
susanc  2362  a147_  norm  PD  Dependency      0:00  24:00:00      1     1  32GB/node  afterok:2359,afterok:2360
susanc  2363  a147_  norm  PD  Dependency      0:00  24:00:00      1     1  32GB/node  afterok:2359,afterok:2360
susanc  2364  a147_  gpu   PD  Dependency      0:00  24:00:00      1     1   2GB/node  afterok:2362,afterok:2363
susanc  2365  a147_  norm  PD  Dependency      0:00  24:00:00      1     1   2GB/node  afterok:2362,afterok:2363
susanc  2374  a147_  norm  PD  Dependency      0:00  20:00:00      1     1   2GB/node  afterok:2364,afterok:2365
===============================================================================================================
You can follow the progress of the job by watching the jobs proceed, and by examining the files in the debug subdirectory.

Modified juicer script

We also provide a modified juicer script (juicer_nih.sh) that allows easier modification of the job submission parameters. Basic usage is to generate a template configuration file, adapt some settings (mostly the time limits) and then run the script with the configuration file.

Here is the same example as above but with the modified script.

biowulf$ module load juicer
biowulf$ # generate a config file template
biowulf$ juicer_nih.sh -t > juicer.conf
biowulf$ cat juicer.conf

# Cluster configuration settings; use this template to
# generate a config file that determines resources for
# various cluster jobs in the pipeline

# default queue for job steps not listed below
queue="norm"
SB_SPLIT="--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm"
SB_COUNT_LIGATION="--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm"
# do not include memory in align - set by the pipeline automatically
SB_ALIGN="--time=08:00:00 --cpus-per-task=24 --partition=norm"
# merge should be at least 20g
SB_MERGE="--time=08:00:00 --cpus-per-task=8 --mem=20g --partition=norm"
# fragmerge takes a lot of memory
SB_FRAGMERGE="--time=16:00:00 --cpus-per-task=8 --mem=247g --partition=norm"
SB_DEDUP="--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm"
SB_POST_DEDUP="--time=01:40:00 --cpus-per-task=2 --mem=4g --partition=norm"
SB_STATS="--time=04:00:00 --cpus-per-task=2 --mem=20g --partition=norm"
SB_ABNORMALS="--time=04:00:00 --cpus-per-task=2 --mem=6g --partition=norm"
SB_HIC="--time=04:00:00 --cpus-per-task=2 --mem=50g --partition=norm"
# this has to run on a gpu partition!
SB_HICCUPS_WRAP="--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=gpu --gres=gpu:k20x:1"
SB_ARROWHEAD_WRAP="--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm"
SB_PREP_DONE="--time-20:00:00 --cpus-per-task=2 --mem=4g --partition=norm"

biowulf$ # edit the config file - change --time to 16h for SB_ALIGN, 
            # the alignment step
biowulf$ cat juicer.conf
[...snip...]
SB_ALIGN="--time=16:00:00 --cpus-per-task=24 --partition=norm"
[...snip...]
biowulf$ # Run the pipeline
biowulf$ juicer_nih.sh -c juicer.conf -p $JUICER/references/hg19.chrom.sizes
-- CLUSTER SETTINGS ----------------------------------------------
queue=norm
SB_SPLIT=--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm
SB_COUNT_LIGATION=--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm
SB_ALIGN=--time=16:00:00 --cpus-per-task=24 --partition=norm
SB_MERGE=--time=08:00:00 --cpus-per-task=8 --mem=20g --partition=norm
SB_FRAGMERGE=--time=16:00:00 --cpus-per-task=8 --mem=247g --partition=norm
SB_DEDUP=--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm
SB_POST_DEDUP=--time=01:40:00 --cpus-per-task=2 --mem=4g --partition=norm
SB_STATS=--time=04:00:00 --cpus-per-task=2 --mem=20g --partition=norm
SB_ABNORMALS=--time=04:00:00 --cpus-per-task=2 --mem=6g --partition=norm
SB_HIC=--time=04:00:00 --cpus-per-task=2 --mem=50g --partition=norm
SB_HICCUPS_WRAP=--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=gpu --gres=gpu:k20x:1
SB_ARROWHEAD_WRAP=--time=04:00:00 --cpus-per-task=2 --mem=4g --partition=norm
SB_PREP_DONE=--time-20:00:00 --cpus-per-task=2 --mem=4g --partition=norm
------------------------------------------------------------------
Running Juicer version 1.5
(-: Looking for fastq files...fastq files exist
(-: Aligning files matching /data/wresch/test_data/juicer/fastq/*_R*.fastq*
 to genome hg19 with site file /usr/local/apps/juicer/juicer-1.5/SLURM/restriction_sites/hg19_DpnII.txt
(-: Created /data/wresch/test_data/juicer/splits and /data/wresch/test_data/juicer/aligned.
(-: Starting job to launch other jobs once splitting is complete
[...snip...]
(-: Finished adding all jobs... Now is a good time to get that cup of coffee..

biowulf$
GPU Batch job for Hiccups etc.

To run a batch job for the Juicebox tools, e.g. hiccups, arrowhead, dump, pre, apa., set up a batch script along the following lines:

#!/bin/bash

cd /data/$USER/myfiles
module load juicer

${JUICER}/scripts/juicebox  hiccups -m 500 -r 5000 -k KR -f 0.1 -p 4 -i 10 -t 0.01,1.5,1.75,2 ./input.hic output

# use ${JUICER}/scripts/juicebox48g if more memory is required

Submit this job to a GPU node with:

sbatch  -p gpu --mem=25g  --constraint=gpuk20x --gres=gpu:k20x:1   myscript

This command will submit the job to a single GPU, and allocate 25 GB of memory. You can check whether the GPU is being utilized with rsh nodename nvidia-smi. 'jobload' or 'sjobs' will give you the nodename of your allocated node.

[susanc@biowulf ~]$ rsh cn1511 nvidia-smi
Thu Apr 28 14:58:16 2016
+------------------------------------------------------+
| NVIDIA-SMI 346.46     Driver Version: 346.46         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         Off  | 0000:08:00.0     Off |                    0 |
| N/A   27C    P8    31W / 235W |     13MiB /  5759MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20Xm         Off  | 0000:27:00.0     Off |                    0 |
| N/A   43C    P0    86W / 235W |     99MiB /  5759MiB |     66%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1      3634    C   java                                            83MiB |
+-----------------------------------------------------------------------------+
Swarm of jobs

Juicer is not really suitable for swarm jobs, as there is an interactive component to the juicer.sh script.

Interactive job

Juicer is not suitable for an interactive batch job, as the 'srun' command in the juicer.sh script will not work correctly within an interactive batch session.

However, the Juicebox tools, e.g. hiccups, arrowhead, can be run interactively.

[susanc@biowulf ~]$ sinteractive --mem=25g  --constraint=gpuk20x --gres=gpu:k20x:1
salloc.exe: Pending job allocation 17584320
salloc.exe: job 17584320 queued and waiting for resources
salloc.exe: job 17584320 has been allocated resources
salloc.exe: Granted job allocation 17584320
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1511 are ready for job

[susanc@cn1511 ~]$ module load juicer

[susanc@cn1511 ~]$ ${JUICER}/scripts/juicebox
Juicebox Command Line Tools Usage:
       juicebox dump       
       juicebox pre    
       juicebox apa   
       juicebox arrowhead  
       juicebox hiccups   
Type juicebox  for further usage instructions

[susanc@cn1511 ~]$ ${JUICER}/scripts/juicebox  hiccups -m 500 -r 5000 -k KR -f 0.1 -p 4 -i 10 -t 0.01,1.5,1.75,2 ./input.hic output
Reading file: ./input.hic
HiC file version: 8
Running HiCCUPS for resolution 5000
2%
4%
[...]

[susanc@cn1511 ~]$ exit
srun: error: cn1511: task 0: Exited with exit code 1
salloc.exe: Relinquishing job allocation 17584320
salloc.exe: Job allocation 17584320 has been revoked.
[susanc@biowulf ~]

During the run, you can check the GPU usage by finding the node number (from 'jobload' or 'sjobs') and then rsh nodename nvidia-smi. For a juicebox job, you should see a 'java' process running on the GPU. For example:
[susanc@biowulf ~]$ rsh cn1511 nvidia-smi
Thu Apr 28 14:58:16 2016
+------------------------------------------------------+
| NVIDIA-SMI 346.46     Driver Version: 346.46         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         Off  | 0000:08:00.0     Off |                    0 |
| N/A   27C    P8    31W / 235W |     13MiB /  5759MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20Xm         Off  | 0000:27:00.0     Off |                    0 |
| N/A   43C    P0    86W / 235W |     99MiB /  5759MiB |     66%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1      3634    C   java                                            83MiB |
+-----------------------------------------------------------------------------+
Documentation

Juicer website

Typing ''juicer.sh -h" with no other arguments will print out the help menu.

 juicer.sh -h
Running Juicer version 1.5
Usage: juicer.sh [-g genomeID] [-d topDir] [-q queue] [-l long queue] [-s site]
                 [-a about] [-R end] [-S stage] [-p chrom.sizes path]
                 [-y restriction site file] [-z reference genome file]
                 [-C chunk size] [-D Juicer scripts directory]
                 [-Q queue time limit] [-L long queue time limit] [-r] [-h] [-x]
* [genomeID] must be defined in the script, e.g. "hg19" or "mm10" (default
  "hg19"); alternatively, it can be defined using the -z command
* [topDir] is the top level directory (default
  "/data/susanc/juicer/debug")
     [topDir]/fastq must contain the fastq files
     [topDir]/splits will be created to contain the temporary split files
     [topDir]/aligned will be created for the final alignment
* [queue] is the queue for running alignments (default "norm")
* [long queue] is the queue for running longer jobs such as the hic file
  creation (default "norm")
* [site] must be defined in the script, e.g.  "HindIII" or "MboI"
  (default "DpnII")
* [about]: enter description of experiment, enclosed in single quotes
* -r: use the short read version of the aligner, bwa aln
  (default: long read, bwa mem)
* [end]: use the short read aligner on read end, must be one of 1 or 2
* [stage]: must be one of "merge", "dedup", "final", "postproc", or "early".
    -Use "merge" when alignment has finished but the merged_sort file has not
     yet been created.
    -Use "dedup" when the files have been merged into merged_sort but
     merged_nodups has not yet been created.
    -Use "final" when the reads have been deduped into merged_nodups but the
     final stats and hic files have not yet been created.
    -Use "postproc" when the hic files have been created and only
     postprocessing feature annotation remains to be completed.
    -Use "early" for an early exit, before the final creation of the stats and
     hic files
* [chrom.sizes path]: enter path for chrom.sizes file
* [restriction site file]: enter path for restriction site file (locations of
  restriction sites in genome; can be generated with the script
  misc/generate_site_positions.py)
* [reference genome file]: enter path for reference sequence file, BWA index
  files must be in same directory
* [chunk size]: number of lines in split files, must be multiple of 4
  (default 90000000, which equals 22.5 million reads)
* [Juicer scripts directory]: set the Juicer directory,
  which should have scripts/ references/ and restriction_sites/ underneath it
  (default /usr/local/apps/juicer/juicer-1.5/SLURM)
* [queue time limit]: time limit for queue, i.e. -W 12:00 is 12 hours
  (default 1200)
* [long queue time limit]: time limit for long queue, i.e. -W 168:00 is one week
  (default 3600)
* -x: exclude fragment-delimited maps from hic file creation
* -h: print this help and exit