regenie: whole genome regression modelling
of large genome-wide association studies.
of large genome-wide association studies.
regenie is a C++ program for whole genome regression modelling
of large genome-wide association studies. It is developed and supported
by a team of scientists at the Regeneron Genetics Center.
regenie employs the BGEN library.
References:
- Joelle Mbatchou, Leland Barnard, Joshua Backman, Anthony Marcketta, Jack A. Kosmicki, Andrey Ziyatdinov, Christian Benner, Colm O'Dushlaine, Mathew Barber, Boris Boutkov, Lukas Habegger, Manuel Ferreira, Aris Baras, Jeffrey Reid, Goncalo Abecasis, Evan Maxwell, Jonathan Marchini.
Computationally efficient whole genome regression for quantitative and binary traits
bioRxiv (2020),doi: https://doi.org/10.1101/2020.06.19.162354.
- Band, G. and Marchini, J.
BGEN: a binary file format for imputed genotype and haplotype data
bioRxiv (2018); doi: https://doi.org/10.1101/308296
Documentation
Important Notes
- Module Name: regenie (see the modules page for more information)
- Unusual environment variables set
- REGENIE_HOME installation directory
- REGENIE_BIN executable directory
- REGENIE_SRC source code directory
- REGENIE_DATA sample data and checkpoints directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive [user@cn3101 ~]$module load regenie/3.0.3 [+] Loading singularity 3.10.0 on cn3063 [+] Loading regenie 3.0.3The available executables are:
[user@cn3101]$ ls $REGENIE_BIN bgenix cat-bgen edit-bgen regenie zstdIn particular, the command line options of the executable regenie are as follows:
[user@cn3101]$ regenie --help |============================| | REGENIE v3.0.3 | |============================| Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Usage: /regenie/regenie [OPTION...] -h, --help print list of available options --helpFull print list of all available options Main options: --step INT specify if fitting null model (=1) or association testing (=2) --bed PREFIX prefix to PLINK .bed/.bim/.fam files --pgen PREFIX prefix to PLINK2 .pgen/.pvar/.psam files --bgen FILE BGEN file --sample FILE sample file corresponding to BGEN file --ref-first use the first allele as the reference for ...To perform training of the predictor network using this executable, copy sample data to the current folder:
[user@cn3101]$ cp $REGENIE_DATA/* .A sample command to run regenie:
[user@cn3101]$ regenie \ --step 1 \ --bgen example.bgen \ --out my_output \ --bsize 200 \ --phenoFile phenotype_bin.txt Start time: Tue Aug 16 13:24:00 2022 |============================| | REGENIE v3.0.3 | |============================| Copyright (c) 2020-2022 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Log of output saved in file : my_output.log Options in effect: --bgen example.bgen \ --out my_output \ --step 1 \ --bsize 200 \ --phenoFile phenotype_bin.txt Fitting null model * bgen : [example.bgen] -summary : bgen file (v1.2 layout, zlib compressed) with 500 named samples and 1000 variants with 8-bit encoding. -index bgi file [example.bgen.bgi] * phenotypes : [phenotype_bin.txt] n_pheno = 2 -keeping and mean-imputing missing observations (done for each trait) -number of phenotyped individuals = 500 * number of individuals used in analysis = 500 -residualizing and scaling phenotypes...done (0ms) * # threads : [55] * block size : [200] * # blocks : [5] for 1000 variants * # CV folds : [5] * ridge data_l0 : [5 : 0.01 0.25 0.5 0.75 0.99 ] * ridge data_l1 : [5 : 0.01 0.25 0.5 0.75 0.99 ] * approximate memory usage : 2MB * setting memory...done Chromosome 1 block [1] : 200 snps (4ms) -residualizing and scaling genotypes...done (3ms) -calc working matrices...done (420ms) -calc level 0 ridge...done (79ms) block [2] : 200 snps (2ms) -residualizing and scaling genotypes...done (1ms) -calc working matrices...done (439ms) -calc level 0 ridge...done (79ms) block [3] : 200 snps (2ms) -residualizing and scaling genotypes...done (1ms) -calc working matrices...done (483ms) -calc level 0 ridge...done (81ms) block [4] : 200 snps (3ms) -residualizing and scaling genotypes...done (1ms) -calc working matrices...done (366ms) -calc level 0 ridge...done (78ms) block [5] : 200 snps (2ms) -residualizing and scaling genotypes...done (1ms) -calc working matrices...done (485ms) -calc level 0 ridge...done (78ms) Level 1 ridge... -on phenotype 1 (Y1)...done (0ms) -on phenotype 2 (Y2)...done (0ms) Output ------ phenotype 1 (Y1) : 0.01 : Rsq = 0.00292408, MSE = 0.995083<- min value 0.25 : Rsq = 0.00619743, MSE = 0.998022 0.5 : Rsq = 0.00679147, MSE = 1.00153 0.75 : Rsq = 0.00753375, MSE = 1.00367 0.99 : Rsq = 0.00733694, MSE = 1.01373 * making predictions...writing LOCO predictions...done (9ms) phenotype 2 (Y2) : 0.01 : Rsq = 0.012437, MSE = 0.98745<- min value 0.25 : Rsq = 0.00739346, MSE = 0.997094 0.5 : Rsq = 0.00612812, MSE = 1.00169 0.75 : Rsq = 0.00621549, MSE = 1.00343 0.99 : Rsq = 0.0082828, MSE = 1.00621 * making predictions...writing LOCO predictions...done (9ms) List of blup files written to: [my_output_pred.list] Elapsed time : 2.66076s End time: Tue Aug 16 13:24:02 2022Another sample command:
[user@cn3101]$ regenie \ --bgen example.bgen \ --step 2 \ --bsize 200 \ --threads 1 \ --covarFile covariates.txt \ --phenoFile phenotype_bin_wNA.txt \ --bt --firth --approx \ --pred my_output_pred.list \ --out my_output_step2.txt Association testing mode with fast multithreading using OpenMP * bgen : [example.bgen] -summary : bgen file (v1.2 layout, zlib compressed) with 500 named samples and 1000 variants with 8-bit encoding. -index bgi file [example.bgen.bgi] * phenotypes : [phenotype_bin_wNA.txt] n_pheno = 2 -number of phenotyped individuals = 500 * covariates : [covariates.txt] n_cov = 3 -number of individuals with covariate data = 500 * number of individuals used in analysis = 500 * case-control counts for each trait: - 'Y1': 111 cases and 339 controls - 'Y2': 115 cases and 385 controls * LOCO predictions : [my_output_pred.list] -file [/vf/users/denisovga/regenie/test/my_output_1.loco] for phenotype 'Y1' -file [/vf/users/denisovga/regenie/test/my_output_2.loco] for phenotype 'Y2' * # threads : [1] * block size : [200] * # blocks : [5] * approximate memory usage : 2MB * using minimum MAC of 5 (variants with lower MAC are ignored) * using fast Firth correction for logistic regression p-values less than 0.05 Chromosome 1 [5 blocks in total] -reading loco predictions for the chromosome...done (0ms) -fitting null logistic regression on binary phenotypes...done (1ms) -fitting null Firth logistic regression on binary phenotypes...done (0ms) block [1/5] : done (10ms) block [2/5] : done (8ms) block [3/5] : done (8ms) block [4/5] : done (7ms) block [5/5] : done (8ms) Association results stored separately for each trait in files : * [my_output_step2.txt_Y1.regenie] * [my_output_step2.txt_Y2.regenie] Number of tests with Firth correction : 108 Number of failed tests : (0/108) Number of ignored tests due to low MAC : 0 Elapsed time : 0.086111s End time: Mon Dec 16 15:21:50 2024End the interactive session:
[user@cn3101 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$