nextflow on Biowulf

Nextflow is a domain specific language modelled after UNIX pipes. It simplifies writing parallel and scalable pipelines. The version installed on our systems can run jobs locally (on the same machine) and by submitting to Slurm.

nf-core is a community effort to collect a curated set of analysis pipelines built using Nextflow. There are more than 90 pipelines available as part of nf-core.

EPI2ME Labs maintains a collection of Nextflow bioinformatics workflows tailored to Oxford Nanopore Technologies long-read sequencing data. Workflow projects are prefixed with wf-.

The code that is executed at each pipeline stage can be written in a number of different languages (shell, python, R, ...).

Intermediate results for workflows are stored in the $PWD/work directory which allows resuming execution of pipelines.

The language used to write pipeline scripts is an extension of groovy.

Nextflow is a complex workflow management tool. Please read the manual carefully and make sure to place appropriate limits on your pipeline to avoid submitting too many jobs or running too many local processes.

Nextflow, when running many tasks appears to create many temp files in the ./work directory. Please make sure that your pipeline does not inadvertently create millions of small files which would result in a degradation of file system performance.

Nextflow, by default, with no config file, will spawns parallel task executions in the computer on which it is running. This is not a good practice in HPC systems which are designed to share compute resources across many users. Please use -profile biowulflocal to utilized allocated resources.

Common pitfalls
top
FATAL error while mount /gsx
/gsx is retired from the cluster, thus if you see some errors like this:
WARNING: skipping mount of /gs6: stat /gs6: no such file or directory FATAL: container creation failed: mount /gs6->/gs6 error: while mounting /gs6: mount source /gs6 doesn't exist.

Please update to the most recent version of nextflow.config, and then run your pipeline again:
cp /usr/local/apps/nextflow/nextflow.config .

docker: command not found
The two most popular containerization systems are Singularity/Apptainer and Docker, HPC facilities will not use Docker as it provides root access to the host system, and instead will use Singularity. For most pipelines, running biowulf or biowulflocal profile after copy the config file to work directory will avoid this error:

cp /usr/local/apps/nextflow/nextflow.config . # only need to copy once

nextflow run xxxx -profile biowulflocal # run inside of interactive session


Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

First, let's do some basic local execution. For this we will allocate an interactive session:

[user@biowulf]$ sinteractive --mem=10g -c2 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ module load nextflow
[user@cn3144]$ nextflow run hello
N E X T F L O W  ~  version 23.10.0
Pulling nextflow-io/hello ...
 downloaded from https://github.com/nextflow-io/hello.git
 Launching `https://github.com/nextflow-io/hello` [elegant_almeida] DSL2 - revision: 7588c46ffe [master]
 executor >  local (4)
 [05/a7b6cf] process > sayHello (4) [100%] 4 of 4 ✔
 Ciao world!

 Bonjour world!

 Hello world!

 Hola world!

For the traditional hello world example we will parallelize the uppercasing of different language greetings:

# create file of greetings
[user@cn3144]$ mkdir testdir;cat > ./testdir/test1 <<EOF
Hello world!
Hallo world!
Ciao world!
Salut world!
Bongiorno world!
Servus world!
EOF
[user@cn3144]$ cat > ./testdir/test2 <<EOF
Gruess Gott world!
Na was los world!
Gruetzi world!
Hello world!
Come va world!
Ca va world!
Hi world!
Good bye world!
EOF
[user@cn3144]$ cat > test.R <<EOF
args <- commandArgs(trailingOnly = TRUE)
library(readr)
df <- read_lines(args[1])

# create output
sink(args[2])

for (each in df){
  cat (toupper(each))
  cat ('\n')
  }

sink()
EOF

We then create a file called rhello.nf that describes the workflow to be executed

// Declare syntax version
nextflow.enable.dsl=2

params.output_dir = './results'

process getsbatchlist {
  module 'R'

  input:
   path(input_file)
   each Rs

  publishDir "${params.output_dir}"

  output:
   path "${input_file}.txt"

  script:
   """
   Rscript ${Rs} ${input_file} ${input_file}.txt
   """
}

workflow {
  def inputf =  Channel.fromPath('./testdir/test*')
  def Rs = Channel.fromPath('./test.R')
  getsbatchlist(inputf,Rs) | view
}

The workflow is executed with

[user@cn3144]$ nextflow rhello.nf
N E X T F L O W  ~  version 23.04.1
Launching `test2.nf` [hopeful_cray] DSL2 - revision: 7401a333f4
executor >  local (2)
[28/6ccadb] process > getsbatchlist (2) [100%] 2 of 2 ✔
/gpfs/gsfs8/users/apptest2/work/82/9d153e5b2a5ab4399ab36beb01e552/test2.txt
/gpfs/gsfs8/users/apptest2/work/28/6ccadb35b01a2755c9670375ec1a05/test1.txt

[user@cn3144]$ cat results/test1.txt
HELLO WORLD!
HALLO WORLD!
CIAO WORLD!
SALUT WORLD!
BONGIORNO WORLD!
SERVUS WORLD!

Note that results are out of order.

The same workflow can be used to run each of the processes as a slurm job by creating a nextflow.config file. We provide a file with correct settings for biowulf at /usr/local/apps/nextflow/nextflow.config. If you use this file please don't change settings for job submission and querying (pollInterval, queueStatInterval, and submitRateLimit). In particular you might want to remove the lscratch allocation if that does not apply to your workflow. Although it was encouraged to use lscratch as much as you can.

[user@cn3144]$ cp /usr/local/apps/nextflow/nextflow.config .
[user@cn3144]$ cat nextflow.config
params {
  config_profile_description = 'Biowulf nf-core config'
  config_profile_contact = 'staff@hpc.nih.gov'
  config_profile_url = 'https://hpc.nih.gov/apps/nextflow.html'
  max_memory = '224 GB'
  max_cpus = 32
  max_time = '72 h'

  igenomes_base = '/fdb/igenomes_nf/'
}


// use a local executor for short jobs and it has to give -c and --mem to make nextflow
// allocate the resource automatically. For this the
// settings below may have to be adapted to the allocation for
// the main nextflow job.

singularity {
    enabled = true
    autoMounts = true
    cacheDir = "/data/$USER/nxf_singularity_cache"
    envWhitelist='https_proxy,http_proxy,ftp_proxy,DISPLAY,SLURM_JOB_ID,SINGULARITY_BINDPATH'
}

env {
    SINGULARITY_CACHEDIR="/data/$USER/.singularity"
    PYTHONNOUSERSITE = 1
}

profiles {
    biowulflocal {
        process {
            executor = 'local'
            cache = 'lenient'
            maxRetries = 3
            queueSize = 100
            memory = "$SLURM_MEM_PER_NODE MB"
            cpus = "$SLURM_CPUS_PER_TASK"
       }
    }

    biowulf {
        process {
            executor = 'slurm'
            maxRetries = 1
            queue = 'norm'
            queueSize = 200
            pollInterval = '2 min'
            queueStatInterval = '5 min'
            submitRateLimit = '6/1min'
            retry.maxAttempts = 1

            clusterOptions = ' --gres=lscratch:200 '

            scratch = '/lscratch/$SLURM_JOB_ID'
            // with the default stageIn and stageOut settings using scratch can
            // result in humungous work folders
            // see https://github.com/nextflow-io/nextflow/issues/961 and
            //     https://www.nextflow.io/docs/latest/process.html?highlight=stageinmode
            stageInMode = 'symlink'
            stageOutMode = 'rsync'

            // for running pipeline on group sharing data directory, this can avoid inconsistent files timestamps
            cache = 'lenient'

        // example for setting different parameters for jobs with a 'gpu' label
        // withLabel:gpu {
        //    queue = 'gpu'
        //    time = '4h'
        //    clusterOptions = " --gres=lscratch:400,gpu:1 "
        //    clusterOptions = ' --constraint="gpua100|gpuv100|gpuv100x" '
        //    containerOptions = " --nv "
        // }

        // example for setting different parameters for a process name
        //  withName: 'FASTP|MULTIQC' {
        //  cpus = 6
        //  queue = 'quick'
        //  memory = '6 GB'
        //  time = '4h'
        // }

        // example for setting different parameters for jobs with a resource label
        //  withLabel:process_low {
        //  cpus = 2
        //  memory = '12 GB'
        //  time = '4h'
        // }
        // withLabel:process_medium {
        //  cpus = 6
        //  memory = '36 GB'
        //  time = '12h'
        // }
        // withLabel:process_high {
        //  cpus = 12
        //  memory = '72 GB'
        //  time = '16 h'
        // }
     }
        timeline.enabled = true
        report.enabled = true
    }
}


[user@cn3144]$ nextflow run -profile biowulf hello.nf
N E X T F L O W  ~  version 20.10.0
Launching `hello.nf` [intergalactic_cray] - revision: f195027c60
executor >  slurm (15)
[34/d935ef] process > splitLetters        [100%] 1 of 1 ✔
HELLO WORLD!
[...snip...]
[97/85354f] process > convertToUpper (11) [100%] 14 of 14 ✔
Running nextflow with biowulf profile (slurm executor) using test input from nf-core:
[user@cn3144]$ nextflow run nf-core/sarek -profile test,biowulf --outdir testout
N E X T F L O W  ~  version 22.10.4
Launching `https://github.com/nf-core/sarek` [agitated_noyce] DSL2 - revision: c87f4eb694 [master]

WARN: Found unexpected parameters:
* --test_data_base: https://raw.githubusercontent.com/nf-core/test-datasets/modules
- Ignore this warning: params.schema_ignore_params = "test_data_base"



------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
      ____
    .´ _  `.
   /  |\`-_ \      __        __   ___
  |   | \  `-|    |__`  /\  |__) |__  |__/
   \ |   \  /     .__| /¯¯\ |  \ |___ |  \
    `|____\´

  nf-core/sarek v3.1.2
------------------------------------------------------
...

Run nextflow with local executor (biowulflocal profile) to utilize allocated cpus and memory on compute node, --mem and -c is essential, and use lscratch as work directory for troubleshooting:

[user@biowulf]$sinteractive --mem=80g -c 32 --gres=lscratch:200
[user@cn3144]$ nextflow run nf-core/sarek -r 3.2.3 \
-profile biowulflocal \
--wes \
--joint_germline \
--input test.csv \
--tools haplotypecaller,vep,snpeff \
--outdir /data/$USER/sarek/ \
--genome GATK.GRCh38 \
--igenomes_base /fdb/igenomes_nf \
--save_output_as_bam \
-w /lscratch/$SLURM_JOB_ID \
--cache_version 110 \
--vep_cache /fdb/VEP/110/cache \
--snpeff_cache /fdb/snpEff/5.1d/data/

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. nf_main.sh) to run the master process. For example:

#! /bin/bash
#SBATCH --job-name=nextflow-main
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH --gres=lscratch:200
#SBATCH --time=24:00:00

module load nextflow
export NXF_SINGULARITY_CACHEDIR=/data/$USER/nxf_singularity_cache;
export SINGULARITY_CACHEDIR=/data/$USER/.singularity;
export TMPDIR=/lscratch/$SLURM_JOB_ID
export NXF_JVM_ARGS="-Xms2g -Xmx4g"

nextflow run nf-core/rnaseq -r 3.13.2 \
-profile biowulf \
--input samplesheet_test.csv \
--outdir /data/$USER/rnaseq_out \
--gtf /fdb/igenomes_nf/Homo_sapiens/Ensembl/pub/release-110/gtf/Homo_sapiens.GRCh38.110.gtf \
--fasta \
/fdb/igenomes_nf/Homo_sapiens/Ensembl/pub/release-110/fasta/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--star_index /fdb/igenomes_nf/Homo_sapiens/Ensembl/pub/release-110/STARindex/ \
--igenomes_ignore --genome null \
-resume


Submit this job using the Slurm sbatch command.

sbatch nf_main.sh

The master process submitting jobs should be run either as a batch job or on an interactive node - not on the biowulf login node.