High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
R/Bioconductor

R (the R Project) is a language and environment for statistical computing and graphics. R is similar to S, and provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, ...), and is highly extensible. R provides an open-source alternative to S.

R is designed as a true computer language with control-flow constructions for iteration and alternation, and it allows users to add additional functionality by defining new functions. For computationally intensive tasks, C, C++ and Fortran code can be linked and called at run time. A list of packages installed is available here. A list of obsolete packages is available here.

NOTE: Starting with version 2.14, R comes with direct support for parallel programs. It is single-threaded, unless you take advantage of the parallel package(s) installed on the systems. Single, serial jobs are best run on your desktop machine or on Helix. There are two situations in which it is an advantage to run R on Biowulf:

There are multiple versions of R available. An easy way of selecting the version is to use modules.To see the modules available, type

[user@helix ~]$ module avail R

To select a module, type

[user@helix ~]$ module load R/[ver]

where [ver] is the version of choice. The default version is 3.2.2.

On Helix, once you have loaded the R module, at the prompt, type R to get started. When the R module is loaded, several other modules are loaded as well. When the module is unloaded, so are the additional modules that were loaded with R. The following modules are loaded with R:

R sessions are not allowed on the Biowulf login node. Please submit a batch job or allocate an interactive node.

Biowulf User Guide.

Create a script such as the following:

                   script file /home/username/runR
--------------------------------------------------------------------------
#!/bin/bash
# This file is runR
#
#SBATCH -J runR

date

module load R
R --vanilla < /data/username/Rtests/Rtest.r > /data/username/Rtests/Rtest.out
--------------------------------------------------------------------------

Submit the script using the 'sbatch' command, e.g.

sbatch /home/username/runR

To request more memory, use the --mem flag. e.g.

sbatch --mem=6g /home/username/runR

The swarm program is a convenient way to submit large numbers of jobs. Create a swarm command file containing a single job on each line, e.g.

                swarm command file /home/username/Rjobs
--------------------------------------------------------------------------
R --vanilla < /data/username/R/R1  > /data/username/R/R1.out
R --vanilla < /data/username/R/R2  > /data/username/R/R2.out
R --vanilla < /data/username/R/R3  > /data/username/R/R3.out
R --vanilla < /data/username/R/R4  > /data/username/R/R4.out
R --vanilla < /data/username/R/R5  > /data/username/R/R5.out
....
--------------------------------------------------------------------------

If each R process (a single line in the file above) requires less than 1 GB of memory, submit this by typing:

swarm -f /home/username/Rjobs --module R
Swarm will create the SLURM batch scripts and submit the jobs to the system. See the Swarm documentation for more information.

The parallel package has been installed on Biowulf. Parallel provides functions for parallel execution of R code on machines with multiple CPUs. Unlike other parallel processing methods, all jobs share the full state of R when spawned, so no data or code needs to be initialized. The actual spawning is very fast as well since no new R instance needs to be started.

Parallel includes the dectectCores function which is often used to automatically detect the number of available CPUs. However, it always reports all CPUs available on a node irrespective of how many CPUs were allocated to the job. Therefore batch jobs should use the following function to automatically detect the number of allocated CPUs:

detectBatchCPUs <- function() { 
    ncores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK")) 
    if (is.na(ncores)) { 
        ncores <- as.integer(Sys.getenv("SLURM_JOB_CPUS_PER_NODE")) 
    } 
    if (is.na(ncores)) { 
        return(4) # for helix
    } 
    return(ncores) 
}

The number of detected cores can then be used as usual

ncpus <- detectBatchCPUs() 
options(mc.cores = ncpus) 
mclapply(..., mc.cores = ncpus) 
makeCluster(ncpus)

Batch R jobs using detectBatchCPUs can then be submitted with sbatch --cpus-per-core=X or swarm -t X and will behave properly.

Alternatively, if your code can make use of all CPUs on a node efficiently, you could allocate nodes exclusively ( sbatch --exclusive or swarm -t auto) and use the built-in detectCores function.

R processes running on helix should limit themselves to 2-4 cores at the most.

Documentation for parallel

R can do implicit multithreading when using a subset of optimized functions in the library or functions that take advantage of parallelized routines in the lower level math libraries.

The function crossprod(m) which is equivalent to calculating t(m) %*% m, for example, makes use of implicit parallelism in the underlying math libraries and can benefit from using more than one thread. The number of threads used by such functions is regulated by the environment variable OMP_NUM_THREADS, which the R module sets automatically when loaded as part of a batch or interactive job. Here is the runtime of this function with different values for OMP_NUM_THREADS:

crossprod benchmark

The code used for this benchmark was

# this file is benchmark2.R
runs <- 3
o <- 2^13
b <- 0

for (i in 1:runs) {
  a <- matrix(rnorm(o*o), o, o)
  invisible(gc())
  timing <- system.time({
    b <- crossprod(a)		# equivalent to: b <- t(a) %*% a
  })[3]
  cat(sprintf("%f\n", timing))
}

And was called with

node$ module load R/3.4.0_gcc-4.9.1
node$ OMP_NUM_THREADS=1 Rscript benchmark2.R
node$ OMP_NUM_THREADS=2 Rscript benchmark2.R
...
node$ OMP_NUM_THREADS=32 Rscript benchmark2.R

From within a job that had been allocated 32 CPUs.

Notes:

There appears to also be another level of parallelism within the R libraries. One function that takes advantage of this is the dist function. The level of parallelism allowed with this mechanism seems to be set with two internal R functions (setMaxNumMathThreads and setNumMathThreads). Note that this is a distinct mechanism - i.e. setting OMP_NUM_THREADS has no impact on dist and setMaxNumMathThreads has no impact on the performance of crossprod. Here is the performance of dist with different numbers of threads:

dist benchmark

The timings for this example were created with

# this file is benchmark1.R
rt <- data.frame()
o <- 2^12
m <- matrix(rnorm(o*o), o, o)
for (nt in c(1, 2, 4, 8, 16, 32)) {
    .Internal(setMaxNumMathThreads(nt)) 
    .Internal(setNumMathThreads(nt))
    res <- system.time(d <- dist(m))
    rt <- rbind(rt, c(nt, o, res[3]))
}
colnames(rt) <- c("threads", "order", "elapsed")
write.csv(rt, file="benchmark1.csv", row.names=F)

This was run within an allocation with 32 CPUs with

node$ OMP_NUM_THREADS=1 Rscript benchmark1.R

The same notes about benchmarking as above apply. Also note that there is very little documentation about this to be found online.

Rmpi provides an MPI interface for R [Rmpi documentation].
The package snow (Simple Network of Workstations) implements a simple mechanism for using a workstation cluster for ``embarrassingly parallel'' computations in R. [snow documentation]

Sample Rmpi batch script:

#!/bin/bash


module load R
R --vanilla > myrmpi.out_slurm <<EOF
library(Rmpi)
mpi.spawn.Rslaves(nslaves=$SLURM_NTASKS)
mpi.remote.exec(mpi.get.processor.name())
n <- 3
mpi.remote.exec(double, n)
mpi.close.Rslaves()
mpi.quit()
EOF

Submit the Rmpi example with:

sbatch --ntasks=4 --ntasks-per-core=1 filename.bat

If using more than 16 processors, you will need to request the --multinode partition.

sbatch --ntasks=64 --multinode filename.bat

Sample batch script using snow:

#!/bin/bash

module load R

R --vanilla > myrmpioutsnow_slurm <<EOF

library(snow)
cl <- makeCluster($SLURM_NTASKS, type = "MPI")
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
clusterCall(cl, runif, $SLURM_NTASKS)
stopCluster(cl)

EOF

Submit the snow example with:

sbatch --ntasks=4 --ntasks-per-core=1 filename.bat
Note that it is entirely up to the user to run the appropriate number of processes for the nodes requested. In the example above, the $SLURM_NTASKS variable is set to 4 and exported via the sbatch command, and this variable is used in the script to run 4 snow processes on 2 dual-cpu nodes. Note: myrmpiout_slurm contains the results from the finished job.

Production runs should be run with batch as above, but for testing purposes an occasional interactive run may be useful.

Sample interactive session with Rmpi:

[user@biowulf ~]$ sinteractive -J myRmpitest --ntasks 4
salloc.exe: Granted job allocation 23208
[user@cn0004 ~]$ 
[user@pcn004 ~]$ module load R
[+] Loading gcc 4.4.7 ...
[+] Loading OpenMPI 1.8.1 for GCC 4.4.7 (ethernet) ...
[+] Loading tcl_tk 8.6.1
[+] Loading ATLAS 3.8.4 libraries...
[+] Loading R 3.2.0 on biowulf.nih.gov

[user@pcn0004 ~]$ R --vanilla
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(Rmpi)
> mpi.spawn.Rslaves(nslaves=4)
[cn0004:57808] [[42360,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess_base_jobid 2776104960
        4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: cn0004 
slave1 (rank 1, comm 1) of size 5 is running on: cn0004 
slave2 (rank 2, comm 1) of size 5 is running on: cn0004 
slave3 (rank 3, comm 1) of size 5 is running on: cn0004 
slave4 (rank 4, comm 1) of size 5 is running on: cn0004 

> demo("simplePI")
...
> simple.pi(10000)
[1] 3.141593
> mpi.close.Rslaves()
[1] 1
> mpi.quit()                      #very important
[user@cn0004 ~]
[user@cn0004 ~]$ exit
exit
salloc.exe: Relinquishing job allocation 23209
salloc.exe: Job allocation 23208 has been revoked.

[user@biowulf ~]

Sample interactive session with snow: (user input in bold)

[user@biowulf ~]$ sinteractive -J mysnowtest --ntasks=4
salloc.exe: Granted job allocation 23210
[user@cn0004 ~]$ 
[user@cn004 ~]$ module load R
[+] Loading gcc 4.4.7 ...
[+] Loading OpenMPI 1.8.1 for GCC 4.4.7 (ethernet) ...
[+] Loading tcl_tk 8.6.1
[+] Loading ATLAS 3.8.4 libraries...
[+] Loading R 3.2.0 on cn0004

[user@cn0004 ~]$ R --vanilla

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(snow)
> cl <- makeCluster(4, type = "MPI")
Loading required package: Rmpi
[cn0004:59204] [[41964,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess_base_jobid 2750152704
        4 slaves are spawned successfully. 0 failed.

> sum(parApply(cl, matrix(1:100,10), 1, sum))
[1] 5050
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename  machine 
"cn0004" "x86_64" 

[[2]]
nodename  machine 
"cn0004" "x86_64" 

[[3]]
nodename  machine 
"cn0004" "x86_64" 

[[4]]
nodename  machine 
"cn0004" "x86_64" 

> clusterCall(cl, runif, 3)
[[1]]
[1] 0.4372988 0.9606632 0.9975864

[[2]]
[1] 0.7348379 0.8070261 0.8572994

[[3]]
[1] 0.7965404 0.7158484 0.3440546

[[4]]
[1] 0.83963913 0.70896938 0.06466967

> stopCluster(cl)
[1] 1
> mpi.quit()
[user@cn0004 ~]$ 
[user@cn0004 ~]$ exit
exit
salloc.exe: Relinquishing job allocation 13176
salloc.exe: Job allocation 13177 has been revoked.
[user@biowulf ~]

'Rswarm' is a utility to create a series of R input files from a single R (master) template file with different output filenames and with unique seeds (for the random number generator). It will simultaneously create a swarm command file that can be used to submit the swarm of R jobs. Rswarm was originally developed by Lori Dodd and Trevor Reeve with modifications by the Biowulf staff. To demonstrate the use of Rswarm, we first provide an example. After the example, you will find a more detailed description about its usage.

Say, for the purposes of this example, that the goal of the simulation study is to evaluate properties of the t-test. The code below is meant to create an example that is sufficiently general, but simple. The function "sim.fun" is a loop that generates random normal data, performs the t-test, and extracts the p-value many times. Suppose the following code is saved in a file called sim.fun.R:

sim.fun<-function(n.samp=100, mu=0, sd=1, n.sim, output1, seed){

#######################################
#n.samp is number of samples generated within each simulation
#mu is the specified mean
#sd is the standard deviation
#nsim is the number of simulations, which will be specified as 50
#output1, is an  output tables 
#seed is the seed for set.seed
#######################################

set.seed(seed)

for (i in 1:n.sim){
    y<-rnorm(n.samp, mean=mu, sd=sd)
    out.table1<-t.test(y)$p.value
    if (i==1) APPEND<-FALSE
    else APPEND<-TRUE
   
    write.table(out.table1, output1, append=APPEND,       row.names=FALSE,col.names=FALSE)

}
}

Now suppose you create a two-line file called Rfile.R:

source("sim.fun.R")
sim.fun(n.sim=DUMX, output1="DUMY1",seed=DUMZ)

To swarm this code, we need replicates of the Rfile.R file (like 100 of them), each with a different seed and different output file. The Rswarm function will create the specified number of replicates, supply each with a different seed (from an external file containing seed numbers), and create unique output files for each replicate. Note, that we allow for you to specify the number of simulations within each file, in addition to specifying the number of replicates.

Typing the following Rswarm command at the Biowulf prompt will create 2 replicate files, each specifying 50 simulations, a different seed from a file entitled, "seedfile.txt," and unique output files.

Rswarm --rfile=Rfile.R --sfile=seedfile.txt --path=//data//user/dir/ 
    --reps=2 --sims=50 --start=0 --ext1=.txt

The first of the two output files will be named Rfile1.R and its contents have been changed to:

source("sim.fun.R")
sim.fun(n.sim=50, output1="//data//user//dir//Rfile1.txt",seed=25)

The second file is similar except that the outputfile is named "Rfile2.txt" and the seed is different. The corresponding swarm file is generated, so that these two files can be submitted simply by typing the following command:

swarm -f Rfile.sw --module R

Below you will find additional details about using Rswarm. Rswarm usage:

Usage: Rswarm [options]
   --rfile=[file]   (required) R program requiring replication
   --sfile=[file]   (required) file with generated seeds, one per line
   --path=[path]    (required) directory for output of all files
   --reps=[i]       (required) number of replicates desired
   --sims=[i]       (required) number of sims per file
   --start=[i]      (required) starting file number
   --ext1=[string]    (optional) file extension for output file 1
   --ext2=[string]    (optional) file extension for output file 2`
   --help, -h         print this help text

To use Rswarm, create an R template file containing the desired R commands (see example template below). Within the template file four things must be specified, as described below. Each of these has a specific notation that will be recognized by the Rswarm utility:


Notation in R template file
Number of simulations to be specified in each replicate file DUMX
Output file 1 which captures results "DUMY1"
Output file 2 (optional) "DUMY2"
Random seed DUMZ

For example, DUMX indicates the number of simulations to be performed. When creating the replicate files, Rswarm will replace occurrences of DUMX with the specified number of simulations. Likewise, Rswarm will replace occurrences of "DUMY1" with the name of the output file, and of DUMZ with a unique seed that is pulled from a random seed file. The R template file might be called 'Rfile.R'. The random seeds file is a text file with a p-by-1 vector of randomly generated numbers to use as seeds. Typing

Rswarm --rfile=Rfile.R --sfile=seedfile.txt --path=//data//user//dir/ 
   --reps=2 --sims=50 --start=0 --ext1=.txt
(all on one line) will produce 2 files, named Rfile1.R and Rfile2.R, each of which performs 50 simulations. The seed for Rfile1.R will be the first element in seedfile.txt, while the seed for Rfile2.R will be the second element in seedfile.txt. The output files will be in /data/user/dir/ with the names Rfile1.txt and Rfile2.txt. Note that if a start number other than "0" is specified the files will have different numbers. For example if we type --start=5, the files will Rfile6.R and Rfile7.R. The corresponding output files will change accordingly.

This creates two replicate files, and the swarm file. At the Biowulf prompt, typing:

swarm -f Rfile.sw --module R
will execute the swarm command for these files.

The output files Rfile1.txt and Rfile2.txt will be created. After the program has completed these files can be concatenated into a single file named outRfile.txt with the following command:

cat Rfile*.txt > outRfile.txt