Biowulf High Performance Computing at the NIH
genoml on Biowulf

GenoML is an Automated Machine Learning tool that optimizes machine learning pipelines for genomic data. GenoML will automate the most tedious part of machine learning by intelligently exploring thousands of possible models to find the best one for your data.

The package can be used to:

GenoML was developed by Faraz Faghri, Sayed Hadi Hashemi, Hampton Leonard, Cornelis Blauwendraat, Hirotaka Iwaki, Lana Sargeant, Rafael Jordá Muñoz , Juan A. Botia, Roy H. Campbell, Andrew B. Singleton, Mike A. Nalls. (Developer affiliations)
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --mem=10g --cpus-per-task=16
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ mkdir /data/$USER/genoml; cd /data/$USER/genoml

[user@cn3144 ~]$ unzip  /usr/local/apps/genoml/exampleData.zip

[user@cn3144 ~]$ module load genoml
[+] Loading genoml 1.0.3  ...

[user@cn3144 ~]$  genoml-train --geno-prefix=./exampleData/training \
    --pheno-file=./exampleData/training.pheno --model-file=./exampleModel
======> Dependency Check
======> Dependency Check: [Done]
/var/tmp/tmpl1zdk7d6
======> Pruning the SNPs
Mapping files: 100%|....| 3/3 [00:00<00:00, 61.83it/s]
==========> Pairwise SNP pruning
==========> Pairwise SNP pruning: [Done]
==========> Merging input dataset for model training
==========> Merging input dataset for model training: [Done]
======> Pruning the SNPs: [Done]
======> Training the ML model
======> Training the ML model: [Done]
======> Tuning the ML model
======> Tuning the ML model: [Done]

[user@cn3144 ~]$ genoml-inference --model-file=./exampleModel --valid-dir=./outData \
      --valid-geno-prefix=./exampleData/validation  --valid-pheno-file=./exampleData/validation.pheno
======> Dependency Check
======> Dependency Check: [Done]

======> Validation of ML Model
======> Validation of ML Model: [Done]
====> Automated Machine Learning for Genomic: [Done]

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. genoml.sh). For example:

#!/bin/bash
set -e
module load genoml
cd /data/$USER/genoml-test
unzip  /usr/local/apps/genoml/exampleData.zip
genoml-train --geno-prefix=./exampleData/training \
      --pheno-file=./exampleData/training.pheno --model-file=./exampleModel
genoml-inference --model-file=./exampleModel --valid-dir=./outData \
      --valid-geno-prefix=./exampleData/validation  --valid-pheno-file=./exampleData/validation.pheno

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] genoml.sh
Note: genoml multi-threads effectively to all the processors on a single node. Therefore, it is reasonable to request up to 56 CPUs (the max available on a Biowulf node). If, however, the cluster is busy and the 56-CPU nodes are not available, your job may wait in the queue for longer than desired. You can use 'freen' to see node/CPU availability and modify your submission request appropriately.