Biowulf High Performance Computing at the NIH
GRNBoost: Scalable inference of gene regulatory networks using Apache Spark and XGBoost.

GRNBoost is a library built on top of Apache Spark that implements a scalable strategy for gene regulatory network (GRN) inference. GRNBoost was inspired by GENIE3, a popular algorithm for GRN inference. GRNBoost adopts GENIE3's algorithmic blueprint and aims at improving its runtime performance and data size capability.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=8g 
[user@cn3199 ~]$ module load grnboost
[+] Loading java 1.8.0_11  ...
[+] Loading scala  2.12.2
[+] Loading Apache Spark 2.1.1  (Hadoop 2.7) ...
[+] Loading grnboost  20191216
Start the spark cluster:
[user@cn3199 ~]$ spark start -t 120 2
INFO: Submitted job for cluster fwPsex
Run a sample GRNBoost command:
[user@cn3199 ~]$ $SPARK_HOME/bin/spark-submit \
    --class org.aertslab.grnboost.GRNBoost \
    --master spark://cn3199:7077 \
    --deploy-mode client \
    --jars $GRNBOOST_SRC/lib_amazon_linux/xgboost4j-0.7.jar \
    $GRNBOOST_SRC/target/scala-2.11/GRNBoost.jar -h

GRNBoost 0.1
Usage: GRNBoost [infer] [options]

  -h | --help

  Prints this usage text.

  -v | --version

  Prints the version number.

Command: infer [options]

  -i <i;file> | --input <i;file>

  REQUIRED. Input file or directory.

  -o <file> | --output <file>

  REQUIRED. Output directory.

  -tf <file> | --regulators <file>

  REQUIRED. Text file containing the regulators (transcription factors), one regulator per line.

  -skip <nr> | --skip-headers <nr>

  The number of input file header lines to skip. Default: 0.

  --delimiter <del>

  The delimiter to use in input and output files. Default: TAB.

  -s <nr> | --sample <nr>

  Use a sample of size <nr> of the observations to infer the GRN.

  --targets <gene1,gene2,gene3...>

  List of genes for which to infer the putative regulators.

  -p:<key>=<value> | --xgb-param:<key>=<value>

  Add or overwrite an XGBoost booster parameter. Default parameters are:
  * eta ->      0.01
  * max_depth   ->      1
  * nthread     ->      1
  * silent      ->      1

  -r <nr> | --nr-boosting-rounds <nr>

  Set the number of boosting rounds. Default: heuristically determined nr of boosting rounds.

  --estimation-genes <gene1,gene2,gene3...>

  List of genes to use for estimating the nr of boosting rounds.

  --nr-estimation-genes <nr>

  Nr of randomly selected genes to use for estimating the nr of boosting rounds. Default: 20.


  Enable regularization (using the triangle method). Default: disabled
  When enabled, only regulations approved by the triangle method will be emitted.
  When disabled, all regulations will be emitted.
  Use the 'include-flags' option to specify whether to output the include flags in the result list.


 Enable normalization by dividing the gain scores of the regulations per target over the sum of gain
 Default = disabled.

  --include-flags <true/false>

  Flag whether to output the regularization include flags in the output. Default: false.

  --truncate <nr>

  Only keep the specified number regulations with highest importance score. Default: unlimited.
  (Motivated by the 100.000 regulations limit for the DREAM challenges.)

  -par <nr> | --nr-partitions <nr>

  The number of Spark partitions used to infer the GRN. Default: nr of available processors.


  Inference nor auto-config will launch if this flag is set. Use for parameters inspection.


  Auto-config will launch, inference will not if this flag is set. Use for config testing.

  --report <true/false>

  Set whether to write a report about the inference run to file. Default: true.

Stop the spark cluster (here, the name of the cluster is the same as was assigned after the 'start spark cluster' command):
[user@cn3199 ~]$ park stop fwPsex
End the interactive session:
[user@cn3199 ~]$ exit
[user@biowulf ~]$