Prerequisites

This tutorial requires a basic understanding of high throughput sequencing, genomics, high performance computing and bash scripting.

Software

The following tools are used in this tutorial:

  • GATK 4.3.0.0
  • fastp 0.20.1
  • bwa 0.7.17
  • samtools 1.11
  • mosdepth 0.3.0

All are available on Biowulf as modules.

Sequencing data

In this tutorial we will analyze a trio from the Coriell CEPH/UTAH 1463 pedigree. The sequencing data is part of the illumina platinum genomes project (Eberle et al. 2017).

Pedigree of the family sequenced for this tutorial (CEPH pedigree 1463)

Figure 0.1: Pedigree of the family sequenced for this tutorial (CEPH pedigree 1463)

Data is available from the European Nucleotide archive under accession ERP001960 and dbGAP under accession phs001224.v1.p1.

For convenience, data for the three individuals used in this tutorial are available on Biowulf at /fdb/app_testdata/fastq/Homo_sapiens/platinum_genomes split by flowcell and lane to make assignment of read groups during alignment easier.

Individual EBI accession Type Pair count
NA12891 ERR194160 PE100 775,617,169
NA12892 ERR194161 PE100 843,454,257
NA12878 ERR194147 PE100 787,265,109

To run the whole pipeline, you will need about 700GB in your /data directory. Please run checkquota to make sure you have enough disk storage. If not, please request a storage increase here.

References

Eberle, Michael A., Epameinondas Fritzilas, Peter Krusche, Morten Källberg, Benjamin L. Moore, Mitchell A. Bekritsky, Zamin Iqbal, et al. 2017. “A Reference Data Set of 5.4 Million Phased Human Variants Validated by Genetic Inheritance from Sequencing a Three-Generation 17-Member Pedigree.” Genome Research 27 (1): 157–64. https://doi.org/10.1101/gr.210500.116.