Biowulf High Performance Computing at the NIH
Snakemake on Biowulf

Snakemake aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern domain specific specification language (DSL) in python style.

References:

Documentation
Important Notes

Example pipeline

For a general introduction on how to use snakemake please read through the official Documentation and Tutorial or any of the other materials mentioned above.

In addition, you can have a look at a set of exercises on our GitHub site.

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and use as described below. If all tasks are submitted as cluster jobs this session may only require 2 CPUs. If some or all of the rules are run locally (i.e. within the interactive session) please adjust the resource requirements accordingly. In the example above there are at least 3 rules that are going to be run locally so we will request 8 CPUs.

[user@biowulf]$ sinteractive --mem=6g --cpus-per-task=8
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load snakemake
[user@cn3144 ~]$ # for the local rules
[user@cn3144 ~]$ module load samtools/1.3.1
[user@cn3144 ~]$ cd /path/to/snakefile
[user@cn3144 ~]$ ls -lh
-rw-rw-r-- 1 user group  606 Jan 30 12:56 cluster.json
-rw-rw-r-- 1 user group 4.0K May 12  2015 config.yaml
-rw-rw-r-- 1 user group 4.8K Jan 30 15:06 Snakefile

Run the pipeline locally (i.e. each task would be run as part of this interactive session). Note that snakemake will by default assume that your pipeline is in a file called 'Snakefile'. If that is not the case, provide the filename with '-s SNAKEFILENAME'. This would take a long time:

[user@cn3144 ~]$ snakemake -pr --keep-going -j $SLURM_CPUS_PER_TASK all
Provided cores: 8
Job counts:
        count   jobs
        14      align
        1       all
        14      clean_fastq
        14      fastqc
        6       find_broad_peaks
        8       find_narrow_peaks
        14      flagstat_bam
        14      index_bam
        85
[...snip...]

To submit subjobs that are not marked as localrules to the cluster it is necessary to provide an sbatch template string using variables from the Snakefile or the cluster configuration file:

[user@cn3144 ~]$ sbcmd="sbatch --cpus-per-task={threads} --mem={cluster.mem}"
[user@cn3144 ~]$ sbcmd+=" --time={cluster.time} --partition={cluster.partition}"
[user@cn3144 ~]$ sbcmd+=" --out={cluster.out} {cluster.extra}"
[user@cn3144 ~]$ snakemake -pr --keep-going --local-cores $SLURM_CPUS_PER_TASK \
             --jobs 10 --cluster-config cluster.json --cluster "$sbcmd" \
             --latency-wait 120 all
Provided cluster nodes: 10
Job counts:
        count   jobs
        14      align
        1       all
        14      clean_fastq
        14      fastqc
        6       find_broad_peaks
        8       find_narrow_peaks
        14      flagstat_bam
        14      index_bam
        85
[...snip...]
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Note that --latency-wait 120 is required for pipelines that submit cluster jobs as output files generated on other nodes may not become visible to the parental snakemake job until after some delay.

Batch job
Most jobs should be run as batch jobs.

In this usage mode the main snakemake job is itself submitted as a batch job. It is still possible to either run all the rules locally as part of the single job or have the main job submit each (non-local) rule as another batch job as described above. The example below uses the latter pattern.

Create a batch input file (e.g. snakemake.sh) similar to the following:

#! /bin/bash
# this file is snakemake.sh
module load snakemake samtools || exit 1

sbcmd="sbatch --cpus-per-task={threads} --mem={cluster.mem}"
sbcmd+=" --time={cluster.time} --partition={cluster.partition}"
sbcmd+=" --out={cluster.out} {cluster.extra}"

snakemake -pr --keep-going --local-cores $SLURM_CPUS_PER_TASK \
    --jobs 10 --cluster-config cluster.json --cluster "$sbcmd" \
    --latency-wait 120 all

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=2 --mem=8g snakemake.sh
Using remote files on our object store

Snakemake can make use of remote files stored on a number of different file storage services (AWS S3, Google cloud storage, sftp, http(s), ...). The S3 storage provider can be adapted to use data stored on the HPC object store. If you haven't done so yet, please set up a ~/.boto or ~/.aws/credentials file as describe in our object storage user guide.

By default snakemake will fetch the remote file into the current working directory (the directory where snakemake was started) when a rule needs it and remove it when no other rules depend on it. Since this is going to be a /data directory, it would be more efficient to specify stay_on_remote=True and fetch the file to lscratch or pipe it into downstream tools or use tools that can access the object store themselves. The latter options come at the price of more complexity.

In the following example replace VAULT with your vault name.

# disable warnings about the object store's certificate
import warnings
warnings.filterwarnings("ignore", "Unverified HTTPS request is being made")

# pick one of the accessors os{1,2}naccess{1,2,3}
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(endpoint_url="https://os1naccess1", verify=False)

rule remote_in_cwd:
    """
    snakemake will automatically fetch the object from the object store
    as needed and remove it when done. The local copy will be in the current
    working directory
    """
    input: S3.remote("VAULT/some_project/aln/sample1.bam")
    #                       ^--------------------------- object name starts here
    output: "sample1.flagstat"
    shell:
        """
        module load samtools
        samtools flagstat {input} > {output}
        """

rule stream_from_remote:
    """
    Don't have snakemake store a local copy of the object. Use a tool
    that can directly access data on the object store. Our version of
    samtools can do that, for example.
    """
    input: S3.remote("VAULT/some_project/aln/sample2.bam", stay_on_remote=True)
    output: "sample2.flagstat"
    shell:
        """
        module load samtools
        objname=$(echo {input:q} | sed -e 's|^s3://||')
        samtools flagstat s3+http://obj@$objname > {output}
        """

rule copy_to_lscratch:
    """
    Don't have snakemake store a local copy of the object. Copy it to
    lscratch instead. Again, some name mangling is required
    """
    input: S3.remote("VAULT/some_project/aln/sample3.bam", stay_on_remote=True)
    output: "sample3.flagstat"
    shell:
        """
        module load samtools
        bam=$(echo {input:q} | sed -e 's|^s3://VAULT/||')
        obj get -D /lscratch/$SLURM_JOB_ID $bam
        samtools flagstat  /lscratch/$SLURM_JOB_ID/$bam > {output}
        """