regenie: whole genome regression modelling of large genome-wide association studies.

regenie: whole genome regression modelling
of large genome-wide association studies.

Quick Links

regenie is a C++ program for whole genome regression modelling
of large genome-wide association studies. It is developed and supported by a team of scientists at the Regeneron Genetics Center. regenie employs the BGEN library.

References:

Joelle Mbatchou, Leland Barnard, Joshua Backman, Anthony Marcketta, Jack A. Kosmicki, Andrey Ziyatdinov, Christian Benner, Colm O'Dushlaine, Mathew Barber, Boris Boutkov, Lukas Habegger, Manuel Ferreira, Aris Baras, Jeffrey Reid, Goncalo Abecasis, Evan Maxwell, Jonathan Marchini.
Computationally efficient whole genome regression for quantitative and binary traits
bioRxiv (2020),doi: https://doi.org/10.1101/2020.06.19.162354.
Band, G. and Marchini, J.
BGEN: a binary file format for imputed genotype and haplotype data
bioRxiv (2018); doi: https://doi.org/10.1101/308296

Documentation

Important Notes

Module Name: regenie (see the modules page for more information)
Unusual environment variables set
- REGENIE_HOME installation directory
- REGENIE_BIN executable directory
- REGENIE_SRC source code directory
- REGENIE_DATA sample data and checkpoints directory

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive 
[user@cn3101 ~]$module load regenie/3.0.3  
[+] Loading singularity  3.10.0  on cn3063
[+] Loading regenie  3.0.3

The available executables are:

[user@cn3101]$ ls $REGENIE_BIN 
bgenix  cat-bgen  edit-bgen  regenie  zstd

In particular, the command line options of the executable regenie are as follows:

[user@cn3101]$ regenie --help
              |============================|
              |        REGENIE v3.0.3      |
              |============================|

Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.


Usage:
  /regenie/regenie [OPTION...]

  -h, --help      print list of available options
      --helpFull  print list of all available options

 Main options:
      --step INT                specify if fitting null model (=1) or
                                association testing (=2)
      --bed PREFIX              prefix to PLINK .bed/.bim/.fam files
      --pgen PREFIX             prefix to PLINK2 .pgen/.pvar/.psam files
      --bgen FILE               BGEN file
      --sample FILE             sample file corresponding to BGEN file
      --ref-first               use the first allele as the reference for
...

To perform training of the predictor network using this executable, copy sample data to the current folder:

[user@cn3101]$ cp $REGENIE_DATA/* .

A sample command to run regenie:

[user@cn3101]$ regenie \
                  --step 1 \
                  --bgen example.bgen \
                  --out my_output \
                  --bsize 200 \
                  --phenoFile phenotype_bin.txt
Start time: Tue Aug 16 13:24:00 2022

              |============================|
              |        REGENIE v3.0.3      |
              |============================|

Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.

Log of output saved in file : my_output.log

Options in effect:
  --bgen example.bgen \
  --out my_output \
  --step 1 \
  --bsize 200 \
  --phenoFile phenotype_bin.txt

Fitting null model
 * bgen             : [example.bgen]
   -summary : bgen file (v1.2 layout, zlib compressed) with 500 named samples and 1000 variants with 8-bit encoding.
   -index bgi file [example.bgen.bgi]
 * phenotypes       : [phenotype_bin.txt] n_pheno = 2
   -keeping and mean-imputing missing observations (done for each trait)
   -number of phenotyped individuals  = 500
 * number of individuals used in analysis = 500
   -residualizing and scaling phenotypes...done (0ms)
 * # threads        : [55]
 * block size       : [200]
 * # blocks         : [5] for 1000 variants
 * # CV folds       : [5]
 * ridge data_l0    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * ridge data_l1    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * approximate memory usage : 2MB
 * setting memory...done

Chromosome 1
 block [1] : 200 snps  (4ms)
   -residualizing and scaling genotypes...done (3ms)
   -calc working matrices...done (420ms)
   -calc level 0 ridge...done (79ms)
 block [2] : 200 snps  (2ms)
   -residualizing and scaling genotypes...done (1ms)
   -calc working matrices...done (439ms)
   -calc level 0 ridge...done (79ms)
 block [3] : 200 snps  (2ms)
   -residualizing and scaling genotypes...done (1ms)
   -calc working matrices...done (483ms)
   -calc level 0 ridge...done (81ms)
 block [4] : 200 snps  (3ms)
   -residualizing and scaling genotypes...done (1ms)
   -calc working matrices...done (366ms)
   -calc level 0 ridge...done (78ms)
 block [5] : 200 snps  (2ms)
   -residualizing and scaling genotypes...done (1ms)
   -calc working matrices...done (485ms)
   -calc level 0 ridge...done (78ms)

 Level 1 ridge...
   -on phenotype 1 (Y1)...done (0ms)
   -on phenotype 2 (Y2)...done (0ms)

Output
------
phenotype 1 (Y1) :
  0.01  : Rsq = 0.00292408, MSE = 0.995083<- min value
  0.25  : Rsq = 0.00619743, MSE = 0.998022
  0.5   : Rsq = 0.00679147, MSE = 1.00153
  0.75  : Rsq = 0.00753375, MSE = 1.00367
  0.99  : Rsq = 0.00733694, MSE = 1.01373
  * making predictions...writing LOCO predictions...done (9ms)

phenotype 2 (Y2) :
  0.01  : Rsq = 0.012437, MSE = 0.98745<- min value
  0.25  : Rsq = 0.00739346, MSE = 0.997094
  0.5   : Rsq = 0.00612812, MSE = 1.00169
  0.75  : Rsq = 0.00621549, MSE = 1.00343
  0.99  : Rsq = 0.0082828, MSE = 1.00621
  * making predictions...writing LOCO predictions...done (9ms)

List of blup files written to: [my_output_pred.list]

Elapsed time : 2.66076s
End time: Tue Aug 16 13:24:02 2022

Another sample command:

[user@cn3101]$ regenie \
                  --bgen example.bgen \
                  --step 2 \
                  --bsize 200 \
                  --threads 1 \
                  --covarFile covariates.txt \
                  --phenoFile phenotype_bin_wNA.txt \
                  --bt --firth --approx \
                  --pred my_output_pred.list \
                  --out my_output_step2.txt
Association testing mode with fast multithreading using OpenMP
 * bgen             : [example.bgen]
   -summary : bgen file (v1.2 layout, zlib compressed) with 500 named samples and 1000 variants with 8-bit encoding.
   -index bgi file [example.bgen.bgi]
 * phenotypes       : [phenotype_bin_wNA.txt] n_pheno = 2
   -number of phenotyped individuals  = 500
 * covariates       : [covariates.txt] n_cov = 3
   -number of individuals with covariate data = 500
 * number of individuals used in analysis = 500
 * case-control counts for each trait:
   - 'Y1': 111 cases and 339 controls
   - 'Y2': 115 cases and 385 controls
 * LOCO predictions : [my_output_pred.list]
   -file [/vf/users/denisovga/regenie/test/my_output_1.loco] for phenotype 'Y1'
   -file [/vf/users/denisovga/regenie/test/my_output_2.loco] for phenotype 'Y2'
 * # threads        : [1]
 * block size       : [200]
 * # blocks         : [5]
 * approximate memory usage : 2MB
 * using minimum MAC of 5 (variants with lower MAC are ignored)
 * using fast Firth correction for logistic regression p-values less than 0.05

Chromosome 1 [5 blocks in total]
   -reading loco predictions for the chromosome...done (0ms)
   -fitting null logistic regression on binary phenotypes...done (1ms)
   -fitting null Firth logistic regression on binary phenotypes...done (0ms)
 block [1/5] : done (10ms)
 block [2/5] : done (8ms)
 block [3/5] : done (8ms)
 block [4/5] : done (7ms)
 block [5/5] : done (8ms)

Association results stored separately for each trait in files :
* [my_output_step2.txt_Y1.regenie]
* [my_output_step2.txt_Y2.regenie]

Number of tests with Firth correction : 108
Number of failed tests : (0/108)
Number of ignored tests due to low MAC : 0

Elapsed time : 0.086111s
End time: Mon Dec 16 15:21:50 2024

End the interactive session:

[user@cn3101 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$