nextflow on Biowulf

Nextflow is a domain specific language modelled after UNIX pipes. It simplifies writing parallel and scalable pipelines. The version installed on our systems can run jobs locally (on the same machine) and by submitting to Slurm.

nf-core is a community effort to collect a curated set of analysis pipelines built using Nextflow. There are more than 90 pipelines available as part of nf-core. Here is some basic introduction to run nf-core pipeline: hello nf-core.

EPI2ME Labs maintains a collection of Nextflow bioinformatics workflows tailored to Oxford Nanopore Technologies long-read sequencing data. Workflow projects are prefixed with wf-.

The code that is executed at each pipeline stage can be written in a number of different languages (shell, python, R, ...).

Intermediate results for workflows are stored in the $PWD/work directory which allows resuming execution of pipelines.

The language used to write pipeline scripts is an extension of groovy.

Nextflow is a complex workflow management tool. Please read the manual carefully and make sure to place appropriate limits on your pipeline to avoid submitting too many jobs or running too many local processes.

Nextflow, when running many tasks appears to create many temp files in the ./work directory. Please make sure that your pipeline does not inadvertently create millions of small files which would result in a degradation of file system performance.

Nextflow, by default, with no config file, will spawns parallel task executions in the computer on which it is running. This is not a good practice in HPC systems which are designed to share compute resources across many users. Please use -profile biowulflocal to utilized allocated resources.

Some of epi2me pipelines do not work under nextflow/24.04, so please load nextflow/23.10 instead.

Common pitfalls
top
slurm profile submit jobs with unlimited time and max memory, which stuck at the pending status
Please update to the most recent version of nextflow.config to avoid this bug:
cp /usr/local/apps/nextflow/nextflow.config .

FATAL error while mount /gsx
/gsx is retired from the cluster, thus if you see some errors like this:
WARNING: skipping mount of /gs6: stat /gs6: no such file or directory FATAL: container creation failed: mount /gs6->/gs6 error: while mounting /gs6: mount source /gs6 doesn't exist.

Please update to the most recent version of nextflow.config, and then run your pipeline again:
cp /usr/local/apps/nextflow/nextflow.config .

docker: command not found
The two most popular containerization systems are Singularity/Apptainer and Docker, HPC facilities will not use Docker as it provides root access to the host system, and instead will use Singularity. For most pipelines, running biowulf or biowulflocal profile after copy the config file to work directory will avoid this error:

cp /usr/local/apps/nextflow/nextflow.config . # only need to copy once

nextflow run xxxx -profile biowulflocal # run inside of interactive session


Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

First, let's do some basic local execution. For this we will allocate an interactive session:

[user@biowulf]$ sinteractive --mem=10g -c2 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ module load nextflow
[user@cn3144]$ nextflow run hello
N E X T F L O W  ~  version 23.10.0
Pulling nextflow-io/hello ...
 downloaded from https://github.com/nextflow-io/hello.git
 Launching `https://github.com/nextflow-io/hello` [elegant_almeida] DSL2 - revision: 7588c46ffe [master]
 executor >  local (4)
 [05/a7b6cf] process > sayHello (4) [100%] 4 of 4 ✔
 Ciao world!

 Bonjour world!

 Hello world!

 Hola world!

For the traditional hello world example we will parallelize the uppercasing of different language greetings:

# create file of greetings
[user@cn3144]$ mkdir testdir;cat > ./testdir/test1 <<EOF
Hello world!
Hallo world!
Ciao world!
Salut world!
Bongiorno world!
Servus world!
EOF
[user@cn3144]$ cat > ./testdir/test2 <<EOF
Gruess Gott world!
Na was los world!
Gruetzi world!
Hello world!
Come va world!
Ca va world!
Hi world!
Good bye world!
EOF
[user@cn3144]$ cat > test.R <<EOF
args <- commandArgs(trailingOnly = TRUE)
library(readr)
df <- read_lines(args[1])

# create output
sink(args[2])

for (each in df){
  cat (toupper(each))
  cat ('\n')
  }

sink()
EOF

We then create a file called rhello.nf that describes the workflow to be executed

// Declare syntax version
nextflow.enable.dsl=2

params.output_dir = './results'

process getsbatchlist {
  module 'R'

  input:
   path(input_file)
   each Rs

  publishDir "${params.output_dir}"

  output:
   path "${input_file}.txt"

  script:
   """
   Rscript ${Rs} ${input_file} ${input_file}.txt
   """
}

workflow {
  def inputf =  Channel.fromPath('./testdir/test*')
  def Rs = Channel.fromPath('./test.R')
  getsbatchlist(inputf,Rs) | view
}

The workflow is executed with

[user@cn3144]$ nextflow rhello.nf
N E X T F L O W  ~  version 23.04.1
Launching `test2.nf` [hopeful_cray] DSL2 - revision: 7401a333f4
executor >  local (2)
[28/6ccadb] process > getsbatchlist (2) [100%] 2 of 2 ✔
/gpfs/gsfs8/users/apptest2/work/82/9d153e5b2a5ab4399ab36beb01e552/test2.txt
/gpfs/gsfs8/users/apptest2/work/28/6ccadb35b01a2755c9670375ec1a05/test1.txt

[user@cn3144]$ cat results/test1.txt
HELLO WORLD!
HALLO WORLD!
CIAO WORLD!
SALUT WORLD!
BONGIORNO WORLD!
SERVUS WORLD!

Note that results are out of order.

Config

The same workflow can be used to run each of the processes as a slurm job by creating a nextflow.config file. We provide a file with correct settings for biowulf at /usr/local/apps/nextflow/nextflow.config. If you use this file please don't change settings for job submission and querying (pollInterval, queueStatInterval, and submitRateLimit). In particular you might want to remove the lscratch allocation if that does not apply to your workflow. Although it was encouraged to use lscratch as much as you can.

[user@cn3144]$ cp /usr/local/apps/nextflow/nextflow.config .
[user@cn3144]$ cat nextflow.config
[user@cn3144]$ nextflow run -profile biowulf hello.nf N E X T F L O W ~ version 20.10.0 Launching `hello.nf` [intergalactic_cray] - revision: f195027c60 executor > slurm (15) [34/d935ef] process > splitLetters [100%] 1 of 1 ✔ HELLO WORLD! [...snip...] [97/85354f] process > convertToUpper (11) [100%] 14 of 14 ✔
Running nextflow with biowulf profile (slurm executor) using test input from nf-core:
[user@cn3144]$ nextflow run nf-core/sarek -profile test,biowulf --outdir testout
N E X T F L O W  ~  version 22.10.4
Launching `https://github.com/nf-core/sarek` [agitated_noyce] DSL2 - revision: c87f4eb694 [master]

WARN: Found unexpected parameters:
* --test_data_base: https://raw.githubusercontent.com/nf-core/test-datasets/modules
- Ignore this warning: params.schema_ignore_params = "test_data_base"



------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
      ____
    .´ _  `.
   /  |\`-_ \      __        __   ___
  |   | \  `-|    |__`  /\  |__) |__  |__/
   \ |   \  /     .__| /¯¯\ |  \ |___ |  \
    `|____\´

  nf-core/sarek v3.1.2
------------------------------------------------------
...

Run nextflow with local executor (biowulflocal profile) to utilize allocated cpus and memory on compute node, --mem and -c is essential, and use lscratch as work directory for troubleshooting:

[user@biowulf]$sinteractive --mem=80g -c 32 --gres=lscratch:200
[user@cn3144]$ nextflow run nf-core/sarek -r 3.2.3 \
-profile biowulflocal \
--wes \
--joint_germline \
--input test.csv \
--tools haplotypecaller,vep,snpeff \
--outdir /data/$USER/sarek/ \
--genome GATK.GRCh38 \
--igenomes_base /fdb/igenomes_nf \
--save_output_as_bam \
-w /lscratch/$SLURM_JOB_ID \
--cache_version 110 \
--vep_cache /fdb/VEP/110/cache \
--snpeff_cache /fdb/snpEff/5.1d/data/

Pipeline settings can be provided in a yaml file via -params-file :

[user@cn3144]$ nextflow run nf-core/hic -profile biowulflocal -params-file params.yaml
[user@cn3144]$ cat params.yaml
input: './samplesheet.csv'
outdir: './results/'
fasta: 'https://github.com/nf-core/test-datasets/raw/hic/reference/W303_SGD_2015_JRIU00000000.fsa'
digestion: 'hindiii'
schema_ignore_params: 'genomes,digest,input_paths,input'
min_mapq :  10
min_restriction_fragment_size :  100
max_restriction_fragment_size :  100000
min_insert_size :  100
max_insert_size :  600
bin_size :  '2000,1000'
res_dist_decay :  '1000'
res_tads :  '1000'
tads_caller :  'insulation,hicexplorer'
res_compartments :  '2000'
Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. nf_main.sh) to run the master process. For example:

#! /bin/bash
#SBATCH --job-name=nextflow-main
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH --gres=lscratch:200
#SBATCH --time=24:00:00

module load nextflow
export NXF_SINGULARITY_CACHEDIR=/data/$USER/nxf_singularity_cache;
export SINGULARITY_CACHEDIR=/data/$USER/.singularity;
export TMPDIR=/lscratch/$SLURM_JOB_ID
export NXF_JVM_ARGS="-Xms2g -Xmx4g"

nextflow run nf-core/rnaseq -r 3.13.2 \
-profile biowulf \
--input samplesheet_test.csv \
--outdir /data/$USER/rnaseq_out \
--gtf /fdb/igenomes_nf/Homo_sapiens/Ensembl/pub/release-110/gtf/Homo_sapiens.GRCh38.110.gtf \
--fasta \
/fdb/igenomes_nf/Homo_sapiens/Ensembl/pub/release-110/fasta/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--star_index /fdb/igenomes_nf/Homo_sapiens/Ensembl/pub/release-110/STARindex/ \
--igenomes_ignore --genome null \
-resume


Submit this job using the Slurm sbatch command.

sbatch nf_main.sh

Create a batch script (e.g. nf_local.sh) to run with biowulflocal profile. For example:

#! /bin/bash
#SBATCH --job-name=nextflow-local
#SBATCH --cpus-per-task=32
#SBATCH --mem=80G
#SBATCH --gres=lscratch:200
#SBATCH --time=24:00:00

module load nextflow
export NXF_SINGULARITY_CACHEDIR=/data/$USER/nxf_singularity_cache;
export SINGULARITY_CACHEDIR=/data/$USER/.singularity;
export TMPDIR=/lscratch/$SLURM_JOB_ID

nextflow run nf-core/hic -profile biowulflocal -params-file params.yaml

Submit this job using the Slurm sbatch command.

sbatch nf_local.sh

Create a batch script (e.g. wf_basecalling_local.sh) to run with biowulflocal profile. For example:

#! /bin/bash
#SBATCH --job-name=wf-basecaling
#SBATCH --cpus-per-task=12
#SBATCH --mem=64G
#SBATCH --time=4:00:00
#SBATCH --gres=lscratch:200,gpu:1
#SBATCH --partition=gpu
#SBATCH --constraint="gpua100|gpuv100x|gpuv100"

module load nextflow
export NXF_SINGULARITY_CACHEDIR=/data/$USER/nxf_singularity_cache;
export SINGULARITY_CACHEDIR=/data/$USER/.singularity;
export TMPDIR=/lscratch/$SLURM_JOB_ID

nextflow run epi2me-labs/wf-basecalling \
-profile biowulflocal \
-resume \
--input wf-basecalling-demo/input \
--ref wf-basecalling-demo/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \
--dorado_ext pod5 \
--out_dir output \
--basecaller_cfg dna_r10.4.1_e8.2_400bps_hac@v4.1.0 \
--remora_cfg "dna_r10.4.1_e8.2_400bps_hac@v4.1.0_5mCG_5hmCG@v2"

Submit this job using the Slurm sbatch command.

sbatch wf_basecalling_local.sh

Submit this job using the Slurm sbatch command.

The master process submitting jobs should be run either as a batch job or on an interactive node - not on the biowulf login node.