nextflow on Biowulf

Nextflow is a domain specific language modelled after UNIX pipes. It simplifies writing parallel and scalable pipelines. The version installed on our systems can run jobs locally (on the same machine) and by submitting to Slurm.

The code that is executed at each pipeline stage can be written in a number of different languages (shell, python, R, ...).

Intermediate results for workflows are stored in the $PWD/work directory which allows resuming execution of pipelines.

The language used to write pipeline scripts is an extension of groovy.

Nextflow is a complex workflow management tool. Please read the manual carefully and make sure to place appropriate limits on your pipeline to avoid submitting too many jobs or running too many local processes.

Nextflow, when running many tasks appears to create many temp files in the ./work directory. Please make sure that your pipeline does not inadvertantly create millions of small files which would result in a degradation of file system performance.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

First, let's do some basic local execution. For this we will allocate an interactive session:

[user@biowulf]$ sinteractive --mem=10g -c2 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ module load nextflow

For the traditional hello world example we will parallelize the uppercasing of different language greetings:

# create file of greetings
[user@cn3144]$ cat > greetings.txt <<EOF
Hello world!
Hallo world!
Ciao world!
Salut world!
Bongiorno world!
Servus world!
Gruess Gott world!
Na was los world!
Gruetzi world!
Hello world!
Come va world!
Ca va world!
Hi world!
Good bye world!
EOF

We then create a file called hello.nf that describes the workflow to be executed

// vim: set ft=groovy:

params.file = file('greetings.txt').toAbsolutePath()

process splitLetters {
    output:
    file 'chunk_*' into letters mode flatten

    """
    pwd
    split -l1 '${params.file}' chunk_
    """
}

process convertToUpper {
    input:
    file x from letters

    output:
    stdout result
    """
    cat $x | tr '[a-z]' '[A-Z]'
    """
}

result.subscribe {
    println it.trim()
}

The workflow is executed with

[user@cn3144]$ nextflow run hello.nf
N E X T F L O W  ~  version 20.10.0
Launching `hello.nf` [prickly_knuth] - revision: f195027c60
executor >  local (5)
[ee/06a621] process > splitLetters       [100%] 1 of 1 ✔
[d9/5e328a] process > convertToUpper (3) [  0%] 0 of 11
SALUT WORLD!
BONGIORNO WORLD!
CIAO WORLD!
HELLO WORLD!
HALLO WORLD!
SERVUS WORLD!
GRUETZI WORLD!
COME VA WORLD!
CA VA WORLD!
executor >  local (15)
[ee/06a621] process > splitLetters       [100%] 1 of 1 ✔
executor >  local (15)
[ee/06a621] process > splitLetters        [100%] 1 of 1 ✔
[b4/2b9395] process > convertToUpper (13) [100%] 14 of 14 ✔

Note that results are out of order.

The same workflow can be used to run each of the processes as a slurm job by creating a nextflow.config file. We provide a file with correct settings for biowulf at /usr/local/apps/nextflow/nextflow.config. If you use this file please don't change settings for job submission and querying (pollInterval, queueStatInterval, and submitRateLimit). In particular you might want to remove the lscratch allocation for all jobs if that does not apply to your workflow.

[user@cn3144]$ cp /usr/local/apps/nextflow/nextflow.config .
[user@cn3144]$ cat nextflow.config
params {
  config_profile_description = 'Biowulf nf-core config'
  config_profile_contact = 'staff@hpc.nih.gov'
  config_profile_url = 'https://hpc.nih.gov/apps/nextflow.html'
  max_memory = '224 GB'
  max_cpus = 32
  max_time = '72 h'

  igenomes_base = '/fdb/igenomes/'
}

container_mounts =  '-B/gs10,/gs11,/gs12,/gs6,/gs8,/gs9,/vf,/spin1,/data,/fdb,/gpfs,/lscratch'

// use a local executor for short jobs. For this the
// settings below may have to be adapted to the allocation for
// the main nextflow job
executor {
    $local {
        queueSize = 100
        memory = '12 G'
        cpus = '6'
    }
    $slurm {
        queue = 'norm'
        queueSize = 200
        pollInterval = '1 min'
        queueStatInterval = '5 min'
        submitRateLimit = '6/1min'
        retry.maxAttempts = 1
    }
}


profiles {
    biowulf {
        process {
            executor = 'slurm'
            errorStrategy = 'finish'
            maxRetries = 0
            clusterOptions = ' --gres=lscratch:200 '
            containerOptions = " $container_mounts "

            scratch = '/lscratch/$SLURM_JOBID'
            // with the default stageIn and stageOut settings using scratch can
            // result in humungous work folders
            // see https://github.com/nextflow-io/nextflow/issues/961 and
            //     https://www.nextflow.io/docs/latest/process.html?highlight=stageinmode
            stageInMode = 'symlink'
            stageOutMode = 'rsync'
        }

        // example for setting different parameters for jobs with a 'gpu' label
        //withLabel:gpu {
        //    queue = 'gpu'
        //    time = '36h'
        //    clusterOptions = " --gres=lscratch:400,gpu:v100x:1 "
        //    containerOptions = " --nv $mounts "
        //}

        singularity.enabled = true
        singularity.autoMounts = true
        singularity.cacheDir = "$PWD/singularity"
        singularity.envWhitelist='https_proxy,http_proxy,ftp_proxy,DISPLAY,SLURM_JOBID'
        
        timeline.enabled = true
        report.enabled = true
    }
}
[user@cn3144]$ nextflow run -profile biowulf hello.nf
N E X T F L O W  ~  version 20.10.0
Launching `hello.nf` [intergalactic_cray] - revision: f195027c60
executor >  slurm (15)
[34/d935ef] process > splitLetters        [100%] 1 of 1 ✔
HELLO WORLD!
[...snip...]
[97/85354f] process > convertToUpper (11) [100%] 14 of 14 ✔

The master process submitting jobs should be run either as a batch job or on an interactive node - not on the biowulf login node.