cmdstan on Biowulf
Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.
-- from https://mc-stan.org/

cmdstan is the command line interface to stan. Because it is meant to be used from within its own source tree, Biowulf provides a helper script which assists users in compiling their own local version of cmdstan as well as an example model, data, and batch script for illustration.

Stan models have to be explicitly written to use within-chain parallelism. See the stan documentation.

R and python include stan interfaces as well.

Documentation
Important Notes

Batch job

cmdstan assumes that it is run from within the source tree it was buit in. To make it simple to use, we provide a wrapper script that builds a local copy of cmdstan with a given compiler/MPI combination. The script along with source tar balls are the only thing provided by the module.

As a first step, create a cmdstan build:

[user@biowulf]$ module load cmdstan
[+] Loading cmdstan  2.21.0-mpi
[user@biowulf]$ cmdstan help
NAME
  cmdstan -- cmdstan installation wrapper
SYNOPSIS
  cmdstan 
COMMANDS
  help     show this help message
  compile [--mpi|--nompi]
           submit a compile job. Deaults to --mpi
  example [--nobuild]
           copy example to local stan directory and build it
           (unless the --nobuild is option is used)
DESCRIPTION
  The `cmdstan` wrapper script will submit a batch job to build
  cmdstan in the current directory. You will be notified by
  mail when the build is finished.
  It alows building either the default build or an MPI-based
  build.
VERSIONS
  cmdstan 2.30.1
  gcc     11.3.0
  mpi     openmpi/4.1.3/gcc-11.3.0
  arch    x2680
[user@biowulf]$ cmdstan compile
Submitted compile job 53209944
You will receive an email when compile is complete

This will submit a batch job compiling cmdstan in the current working directory. It will use the processor architecture, compiler, and mpi module indicated in the help message. Once the job is done, a email notification will be sent. Use --nompi to build stan to use TBB (thread) based parallelism for running within a node. In either case models have to be written explicitly to enable within-chain parallelism and between-chain parallelism may not function properly for MPI. Once the build is done:

[user@biowulf]$ ls -lh
drwxr-xr-x 10 user group   4096 Mar 24 09:00 cmdstan-2.21.0-mpi
-rw-r--r--  1 user group 126916 Mar 24 09:04 cmdstan-2.21.0-mpi.53209944.log
[user@biowulf]$ cd cmdstan-2.21.0-mpi

Stan models have to be compiled into standalone executables from within this directory. The wrapper script can compile an example - again as a batch job because it requires the mpi module and needs to be compiled on an infiniband node

[user@biowulf]$ cmdstan example
running from within compiled cmdstan home dir
copying example
one sec - compiling model as a batch job
  command run: module load openmpi/3.1.4/gcc-9.2.0; make linreg/linreg_par
  This command can be run from a batch job or sinteractive session
  where the openmpi library can be loaded.

Submitted compile job 53210535
You will receive an email when compile is complete

This will create a linreg directory with a model, the compiled model, as well as a batch script for submission. Let's run the example model on a single node with 28 tasks:

[user@biowulf]$ cd linreg
[user@biowulf]$ cat batch.sh
#! /bin/bash
#SBATCH --ntasks=28
#SBATCH --ntasks-per-core=1
#SBATCH --constraint=x2680
#SBATCH --exclusive
#SBATCH --partition=multinode
#SBATCH --gres=lscratch:10
#SBATCH --mem-per-cpu=2g

module load openmpi/3.1.4/gcc-9.2.0

N=1000
n=500
S=$SLURM_NTASKS
O=4
fn=/lscratch/$SLURM_JOB_ID/dat_order_${O}_shards_${S}.Rdump

# copy the input file and set the number of shards
sed "s/shards <- S/shards <- $S/" \
    dat_order_${O}_shards_S.Rdump > $fn


srun --mpi=pmix ./linreg_par  \
    sample num_samples=$N num_warmup=$n \
    data file=$fn \
    output file=output_order_${O}_shards_${S}.csv
[user@biowulf]$ sbatch bash.sh
53161946
[user@biowulf]$ sleep 15m
[user@biowulf]$ cat slurm-53161946.out
...snip...
Gradient evaluation took 0.07 seconds
1000 transitions using 10 leapfrog steps per transition would take 700 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 1500 [  0%]  (Warmup)
Iteration:  100 / 1500 [  6%]  (Warmup)
Iteration:  200 / 1500 [ 13%]  (Warmup)

This model scales well up to 16 tasks (32 CPUs since we set --ntasks-per-core=1 in this example) and still runs at about 70% efficiency with 28 tasks (56 CPUs). In addition, it appears to do fairly well with 2 tasks per core. In the following graph, runtime is shown as a function of the number of CPUs. Light grey circles used 1 task per core. Black used 2 tasks per core. The line shows ideal scaling.

cmdstan scaling

Here is how you would compile your own model. We'll use the same example as above.

[user@biowulf]$ mkdir linreg2
[user@biowulf]$ cp linreg/linreg_par.stan linreg2
[user@biowulf]$ sinteractive --constraint=x2680 --mem=4g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load openmpi/3.1.4/gcc-9.2.0
[user@cn3144]$ make linreg2/linreg_par
...
[user@cn3144]$ exit
[user@biowulf]$