Biowulf High Performance Computing at the NIH
cromwell on Biowulf

A Workflow Management System geared towards scientific workflows.

Documentation

Cromwell is a complex tool for running workflows described by the workflow description language (WDL). On Biowulf, cromwell is used to either run local workflows (all on one node) or distributed workflows (each task a slurm job). Server mode is not supported. Comprehensive documentation of cromwell or WDL is bejond the scope of this brief document. See the following links for detailed documentation:

Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive -c6 --mem=20g --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load cromwell
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ cp -r ${CROMWELL_TEST_DATA:-none}/* .
[user@cn3144]$ tree -s
.
|-- [user   4.0K]  data
|   |-- [user    15M]  sample1.fq.gz
|   |-- [user    15M]  sample2.fq.gz
|   |-- [user    15M]  sample3.fq.gz
|   |-- [user    15M]  sample4.fq.gz
|   |-- [user    15M]  sample5.fq.gz
|   `-- [user    15M]  sample6.fq.gz
|-- [user    189]  input.json
`-- [user   1.1K]  wf.wdl

First, let's look at the workflow

[user@cn3144]$ cat wf.wdl

task lengths {
  File fq_file
  String prefix = sub(fq_file, ".fq.gz", "")
  command <<<
    zcat ${fq_file} \
        | awk 'NR % 4 == 2 {print length($1)}' \
        | sort \
        | uniq -c \
        | sort -k2,2n \
        | awk '{print $2"\t"$1}' \
        > ${prefix}.len
    gnuplot -e "set term dumb; set style data lines; plot '${prefix}.len'" > ${prefix}.len.plot
  >>>
  output {
    File dat  = "${prefix}.len"
    File plot = "${prefix}.len.plot"
  }
  runtime { rt_mem: 2000 rt_time: 10 }
}

task reads {
  File fq_file
  String prefix = sub(fq_file, ".fq.gz", "")
  command <<<
    zcat ${fq_file} \
        | awk 'NR % 4 == 2' \
        | sort \
        | uniq -c \
        | sort -k1,1nr \
        > ${prefix}.fq.nr
  >>>
  output {
    File dat  = "${prefix}.fq.nr"
  }
  runtime { rt_mem: 2000 rt_time: 10 }
}

workflow wf {
  Array[File] fq_files
  scatter(s in fq_files) {
    call lengths { input: fq_file=s }
    call reads {input: fq_file=s }
  }
}

Two tasks that summarize the length trimmed reads in input files, create an ascii graph, and summarize all non-redundant reads. These are steps you might do in a miRNA analysis. Let's validate the WDL file with wdltool:

[user@cn3144]$ java -jar ${WDLTOOL_JAR} validate wf.wdl

[user@cn3144]$ echo $?
0

So wdltool returned a 0 and did not report any errors - the workflow is valid. Next we can combined the workflow with a json file specifying the input files. To do that, we need to modify the input.json file to reflext the current path:

[user@cn3144]$ sed -i "s:XXX:$PWD:" input.json
[user@cn3144]$ cat input.json
{
  "wf.fq_files": [
      "/path/to/here/data/sample1.fq.gz",
      "/path/to/here/data/sample2.fq.gz",
      "/path/to/here/data/sample3.fq.gz",
      "/path/to/here/data/sample4.fq.gz",
      "/path/to/here/data/sample5.fq.gz",
      "/path/to/here/data/sample6.fq.gz"
   ]

}
[user@cn3144]$ module load gnuplot
[user@cn3144]$ java -jar ${CROMWELL_JAR} run -i input.json wf.wdl
[2017-09-08 16:04:32,62] [info] Slf4jLogger started
[2017-09-08 16:04:32,71] [info] RUN sub-command
[2017-09-08 16:04:32,71] [info]   WDL file: /path/to/here/wf.wdl
[2017-09-08 16:04:32,71] [info]   Inputs: /path/to/here/input.json
[...snip...]

[user@cn3144]$ cat data/sample1.len.plot
    400000 ++---------+----------+----------+---------+----------+---------++
           +'/path/to/here/data/sample1.len' ******                         +
    350000 ++             *                                                ++
           |              **                                                |
           |             * *                                                |
    300000 ++            * *                                               ++
           |             * *                                                |
    250000 ++            *  *                                              ++
           |             *  *                                               |
           |             *  *                                               |
    200000 ++           *   *                                              ++
           |            *    *                                              |
    150000 ++           *    *                                             ++
           |            *     *                                             |
           |           *      *                                             |
    100000 ++          *      *                                            ++
           |           *       *                                            |
     50000 ++         *        *                                           ++
           |         **         *                                           |
           +     **** +          **         +         +          +          +
         0 ******-----+----------+-*********************************-------++
           15         20         25         30        35         40         45

Similarly, the same workflow can be run by submitting tasks to slurm. For this, a configuration file is needed. A working example can be found in ${CROMWELL_CONFIG}. For this simple workflow that file will suffice. For a more complicated workflow you may have to copy it and modify according to your needs.

[user@cn3144]$ cat ${CROMWELL_CONFIG}

# include the application.conf at the top
include required(classpath("application"))


backend {
  default = "Slurm"
  providers {
    Slurm {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        concurrent-job-limit = 10
        runtime-attributes = """
        Int rt_time = 600
        Int rt_cpus = 2
        Int rt_mem = 4000
        String rt_queue = "norm"
        """

        submit = """
            sbatch -J ${job_name} -D ${cwd} -o ${out} -e ${err} -t ${rt_time} -p ${rt_queue} \
            ${"-c " + rt_cpus} --mem=${rt_mem} \
            ${script}
        """
        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "(\\d+)"
      }
    }
  }
}
[user@cn3144]$ java -Dconfig.file=${CROMWELL_CONFIG} \
                       -jar ${CROMWELL_JAR} run -i input.json wf.wdl
[2017-09-08 16:08:25,56] [info] Slf4jLogger started
[2017-09-08 16:08:25,64] [info] RUN sub-command
[2017-09-08 16:08:25,64] [info]   WDL file: /spin1/users/wresch/test_data/cromwell/fnord/wf.wdl
[2017-09-08 16:08:25,64] [info]   Inputs: /spin1/users/wresch/test_data/cromwell/fnord/input.json
[...snip...]

End the sinteractive session

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. cromwell.sh). For example:

#!/bin/bash
module load cromwell/31
cromwell < cromwell.in > cromwell.out
java -jar ${CROMWELL_JAR} run -i input.json wf.wdl

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] cromwell.sh