There are three distinct ways to run python code or applications on biowulf:

Using one of the general purpose python modules (see the table below for which modules are available and what packages are installed). The large number of packages available in these environments are updated regularly. These modules are convenient and are most useful for development or when the exact version of installed packages is not important. Note that there is also a link from /usr/local/bin/python to the general purpose python 2.7 environment and from /usr/local/bin/python3 to a python 3 environment
Using a private installation of mambaforge. Users who need stable, reproducible environments are encouraged to install mambaforge in their data directory and create their own private environments. This is less convenient than the general use environments but allows users to quickly install packages themselves, create invariant environmets for stability, and install packages that may not be suited for inclusion in the general purpose environments.
Using the system python 3.6 which is located in /usr/bin/python3. This is the python installed with the operating system and contains few packages. Rarely used.

This is a list of common pitfalls for python users on the cluster. Some of these are discussed in other sections of this documentation in more detail.

The general purpose Python environments are made available through the module system:

These are fully featured (conda) environments with many installed packages including the usual modules that make up the scientific Python stack (see below). These environments best suited for development and for running code that is not dependent on specific versions of the installed packages. Packages in these environments are updated regularly. If you need more stability to ensure reproducibility or because your code depends on the presence of certain fixed versions of packages you can also build your own environments.

Python scientific stack

Because they are compiled against MKL, some mathematical operations (e.g. SVD) can make use of multithreading to accelerate computation. The number of threads such operations will create is determined by the environment variable OMP_NUM_THREADS. To avoid accidentally overloading jobs, especially when also doing multiprocessing, this variable is set to 1 by the python modules. If your code can take advantage of implicit parallelism you can set this variable to match the number of allocated CPUs for cluster jobs or adjust it such that the product of OMP_NUM_THREADS * number of processes equals the number of allocated CPUS. For example, if allocating 16 CPUs and planning on using multiprocessing with 4 workers, OMP_NUM_THREADS can be set as high as 4 without overloading the allocation.

Some common issues with using multiprocessing on biowulf were already pointed out in the common pitfalls section. Beyond those multiprocessing can be made more robust by setting workers up to ignore the SIGINT signal so that a multiprocessing script can be terminated cleanly with scancel or Ctrl-C. The following script also makes sure to correctly detect the number of available CPUs in batch jobs:

It is also important to benchmark scaling of your code. Many algorithms won't scale well beyond a certain number of parallel multiprocessing workers which can lead to very inefficient resource usage (e.g. allocating 56 CPUs when parallel efficiency drops below 50% at 24 CPUs).

In order to use the rpy2 package on biowulf it is necessary to load a separate rpy2 module which allows the package to find the correct R installation.

Ray can be used to parallelize work within a single cluster in which case it is run as a standard single node job. To run a ray cluster on biowulf and parallelize across several nodes, multinode jobs with one task per node are used. The tasks don't have to be exclusive but for real world use they often will be. If you allocate the nodes exclusively make sure to also allocate all resources available on the node.

An example script you can use to configure your own ray workloads is available on GitHub.

Note: ray, like many other tools, will by default try to use all resources on a node even if they were not all allocated to ray. The sample script above specifically specifies resources to the workers and the cluster head process. Similarly, if you use `ray.init` in, for example, a single node job make sure to specify the number of cpus, gpus, and the memory

Spyder is a Python IDE focused on scientific python. It has the ability to connect to a remote spyder kernel running on a compute node using ssh tunnels. Currently the setup is not the most convenient but this may improve in the future.

Follow our tunneling instructions to create the tunnels from your local to biowulf. Then start a spyder kernel in the sinteractive session. In the example below we are using one of the general purpose python modules but you can also install the spyder kernel module into your own environment

The spyder_kernel helper copies a connection file to your home directory. In the example above ~/kernel-652313.json. You can use this file to connect with your locally installed Spyder to the running kernel by either copying the file to your system or mounting your home directory as a hpcdrive. In the example below I am using the connection file directly via hpcdrive:

Since we are forwarding all the required ports it isn't necessary to check the 'remote connection' box. Alternatively you can skip setting up the local tunnels with the command generated by sinteractive and instead provide your login credentials in the remote kernel section of the dialog. Once the connection has been established verify that you are on a compute node

Stop the kernel when you are done by running exit() in the spyder session or exit from the sinteractive session with