NgsRelate on Biowulf

NgsRelate can be used to infer relatedness, inbreeding coefficients and many other summary statistics for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population allele frequencies and have genotype likelihoods.


Important Notes

Getting Started
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load ngsrelate
[+] Loading ngsrelate  2.0  on cn3144
[+] Loading samtools 1.19  ...

[user@cn3144 ~]$ ngsRelate
Usage main analyses: ./ngsrelate  [options]
   -f        Name of file with frequencies
   -O        Output filename
   -L             Number of genomic sites. Must be provided if -f (allele frequency file) is NOT provided
   -m         model 0=normalEM 1=acceleratedEM
   -i        Maximum number of EM iterations
   -t           Tolerance for breaking EM
   -r           Seed for rand
   -R     Region for analysis (only for bcf)
   -g gfile            Name of glf (compressed binary) file
   -G gfile            Name of beagle (compressed) file
   -p             threads (default 4)
   -c             Should call genotypes instead?
   -s             Should you swich the freq with 1-freq?
   -F             Estimate inbreeding instead of estimating the nine jacquard coefficients
   -o             estimating the 3 jacquard coefficient, assumming no inbreeding
   -v             Verbose. print like per iteration
   -e           Errorrates when calling genotypes?
   -a             First individual used for analysis? (zero offset)
   -b             Second individual used for analysis? (zero offset)
   -B             Number of bootstrap replicates for (only for single pairs)
   -N             How many times to start each pair with random seed?
   -n             Number of samples in glf.gz
   -l           minMaf or 1-Maf filter (default: 0.05)
   -z             Name of file with IDs (optional)
   -T          For -h vcf use PL (default) or GT tag
   -A          For -h vcf use allele frequency TAG e.g. AFngsrelate (default)
   -P        plink name of the binary plink file (excluding the .bed)

 ./ngsrelate extract_freq_bim pos.glf.gz plink.bim plink.freq
 ./ngsrelate extract_freq .mafs.gz .pos.glf.gz [-rmTrans]
 ./ngsrelate -h my.bcf [DEVELOPMENT ONLY]

Most jobs should be run as batch jobs.

Calculate Allele frequencies

[user@cn3144 ~]$ mkdir angsd && cd angsd
[user@cn3144 ~]$ module load angsd
[user@cn3144 ~]$ cp $ANGSD_TESTDATA/* .
[user@cn3144 ~]$ tar xf bams.tar.gz 
[user@cn3144 ~]$ for i in bams/*.bam;do samtools index $i;done
[user@cn3144 ~]$ ls bams/*.bam > bam.filelist

#### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz)
[user@cn3144 ~]$  angsd -b bam.filelist -gl 2 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3 
-> angsd version: 0.940-dirty (htslib: 1.16) build(Oct 21 2022 11:56:57)
-> angsd -b bam.filelist -gl 2 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3 
-> No '-out' argument given, output files will be called 'angsdput'
-> Inputtype is BAM/CRAM
[multiReader] 10 samples in 10 input files
-> SNP-filter using a pvalue: 1.000000e-06 correspond to 23.928127 likelihood units
-> Parsing 10 number of samples 
-> Printing at chr: 20 pos:14078459 chunknumber 3400 contains 584 sites
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
-> Mon Jun  3 15:57:24 2024
-> Arguments and parameters for all analysis are located in .arg file
-> Total number of sites analyzed: 1702624
-> Number of sites retained after filtering: 3878 
[ALL done] cpu-time used =  13.33 sec
[ALL done] walltime used =  14.00 sec

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
[user@cn3144 ~]$  zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq

### run NgsRelate
[user@cn3144 ~]$  ngsRelate -g angsdput.glf.gz -n 10 -f freq -O newres
-> Seed is: 1851012280
-> Frequency file: 'freq' contain 3878 number of sites
-> nind:10 overall_number_of_sites:3878
-> Done reading data from file: 0.01 0.00
-> Starting analysis now
-> length of joblist:45
[ALL done] cpu-time used =  0.36 sec (filereading took: 0.01 sec)
[ALL done] walltime used =  0.00 sec (filereading took: 0.00 sec)

For more examples please see the Examples page