Chapter 1 Data preprocessing
1.1 Brief introduction
Data preprocessing includes read trimming, alignment, sorting by coordinate, and marking duplicates. Duplicate marking itself is discussed in Chapter 3. GATK’s duplicate marking tools perform more efficiently with queryname-grouped input as generated by the aligner and produce sorted BAM output so the most efficient pipeline would not write sorted BAM files from the aligner. However, in other cases a sorted BAM file prior to marking duplicates may be desired. Therefore, in this Chapter benchmarks and scripts for either unmodified or sorted BAM output from FASTQ files are presented.
For efficiency trimming, alignment, and converting (or sorting) to BAM format are run concurrently with data flowing through linux pipes to avoid unnecessary IO. The thoughput of each of the three components should be similar to each other and within the range of good parallel efficiency (70~80% of ideal performance).
Whether sorting or not, it is important to not write uncompressed SAM files to disk since conversion to BAM is not computationally intensive and BAM files are much smaller than SAM files.
Note that the GATK best practices pipeline starts from an unaligned BAM file (ubam) and includes some additional steps but for the purposes of this tutorial these are omitted.
1.3 Individual tool benchmarks
First we optimize each individual step.
1.3.1 Adapter trimming
While we use fastp
for this step, there are many other ways of trimming adapters.
In this Section read pairs are written interleaved to stdout
and discarded because
in the later preprocessing Section 1.4.2 read pairs will be piped directly to bwa
to avoid unnecessarily writing temporary data to disk. FASTQ files were read from
lscratch
to avoid inconsistent results due to file system load and all
tests were run on x2695 nodes. The command used was:
fastp -i /lscratch/${SLURM_JOB_ID}/$(basename $r1) \
-I /lscratch/${SLURM_JOB_ID}/$(basename $r2) \
--stdout --thread $threads \
-j "${logs}/fastp-${SLURM_JOB_ID}.json" \
-h "${logs}/fastp-${SLURM_JOB_ID}.html" \
> /dev/null
This will use the default quality filters and remove adapters based on pair overlap. Other filters may reduce throughput.
Figure 1.1 shows that fastp
is fast but parallel efficiency
quickly drops beyond 6 threads. Therefore fastp
should be limited to no more than
6 threads. See table 1.1 for a summary of throughput and efficiency.
threads | Efficiency [%] | pairs/s |
---|---|---|
2 | 100 | 49243 |
4 | 109 | 107353 |
6 | 83 | 123228 |
8 | 69 | 134955 |
10 | 60 | 148827 |
1.3.2 Alignment
In this Section standalone bwa
is used to produce the
primary alignment. For the benchmark, bwa
is fed separate FASTQ files from lscratch
and outputs SAM format alignments to /dev/null
. In the pipeline from later Section 1.4.2 it will ingest interleaved trimmed reads from fastp
and output directly to samtools
. Here is the bwa
command
used in the benchmarks:
bwa mem -M -t ${threads} \
-R "@RG\tID:$id\tPL:ILLUMINA\tLB:$lb\tSM:$sm" \
$genome \
$SLURM_JOBID/$(basename ${r1[small]}) \
/lscratch/$SLURM_JOBID/$(basename ${r2[small]}) \
/lscratch/> /dev/null
Note that the content of the read group (@RG
) SAM header field will be discussed in Section 1.5.
As for fastp
all benchmarks were run on x2695 nodes. Figure 1.2 shows the scaling
of bwa
with increasing numbers of threads.
bwa
scales well to 32 threads (table 1.2).
Even though bwa
scales much better than fastp
but it
does much more computing per read pair it process much less reads per second than fastp
. That means theoretically 2 threads of fastp
(49243 pairs/s) are sufficient to feed 32 threads of bwa
(22087 pairs/s).
threads | Efficiency [%] | pairs/s |
---|---|---|
2 | 100 | 1741 |
4 | 89 | 3087 |
6 | 91 | 4761 |
8 | 84 | 5865 |
10 | 96 | 8392 |
12 | 88 | 9167 |
14 | 90 | 10915 |
16 | 87 | 12085 |
24 | 86 | 18009 |
32 | 79 | 22087 |
1.3.3 Alternative A: Sorting by coordinate
In the later alignment script @ref({#preproc-view}, the SAM output of bwa
will be piped directly
into samtools
to avoid unnecessary writes to disk. However, to test samtools
sort scaling in this Section,
we use an unsorted BAM file as input. Not ideal but it will give us a good
idea of samtools
throughput and scaling. Note that samtools
needs to spill temporary files to
disk if the alignment does not fit in memory which will generally be the case for whole genome
alignments. To improve performance and avoid unnecessary load on the shared file systems, it
is very important to use lscratch
for these temporary files.
The samtools
command in the tests:
samtools sort -T /lscratch/$SLURM_JOBID/part \
-m ${mem_mb_per_thread}M \
-@ ${threads} \
-O BAM \
$SLURM_JOBID/$(basename $bam) \
/lscratch/> /dev/null
Notes:
- The input BAM file was 32GB located in
lscratch
. - Temporary files were stored in
lscratch
. - Output was discarded.
- For these benchmarks the sort was run with different amount of total memory (10GiB, 20GiB, and 30GiB) and different numbers of threads.
- The amount of temporary space on
lscratch
used was about 1.5x that of the input BAM file and 1.75x that of the input compressed FASTQ files. - The job was allocated about 5GB more memory than
threads * mem_per_thread
to avoid out of memory kills (i.e. 15GB, etc.).
Figure 1.3 shows that samtools sort
scales efficiently up to 10 threads
in this test and that the amount of total memory across all threads did not result in measurably
different performance.
threads | Efficiency [%] | pairs/s |
---|---|---|
2 | 100 | 24232 |
4 | 102 | 49383 |
6 | 94 | 68466 |
8 | 99 | 95778 |
10 | 88 | 106872 |
12 | 79 | 114514 |
14 | 79 | 133980 |
16 | 63 | 121872 |
1.3.4 Alternative B: Queryname-grouped (unsorted) BAM
In the later alignment script @ref({#preproc-view}, the SAM output of bwa
will be piped directly
into samtools
to avoid unnecessary writes to disk. Unlike in
1.3.3 in this test the SAM file is converted to a BAM file
without sorting (queries with the same name are grouped) for further processing
as described in Chapter 3.
The samtools
command in the tests:
samtools view -@ ${threads} \
-O BAM \
$(basename $sam) \
/lscratch/\$SLURM_JOBID/> /dev/null
Notes:
- The input SAM file was 29GB located in
lscratch
. - Output was discarded.
samtools view
uses comparatively little memory, 4GB in the tests.
Figure 1.4 shows that samtools view
scales efficiently all
the way to 16 threads, the largest number of threads tested.
threads | Efficiency [%] | pairs/s |
---|---|---|
2 | 100 | 35000 |
4 | 99 | 69346 |
6 | 104 | 109026 |
8 | 86 | 119928 |
10 | 85 | 148816 |
12 | 95 | 199880 |
14 | 91 | 224042 |
16 | 94 | 263068 |
1.4 Optimized script
Based on the individual benchmarks from Section 1.3 we can design an efficient combined pipeline with each tool running at > 80% parallel efficiency.
1.4.1 Alternative A: producing sorted BAM output
- 2 threads of
fastp
piping interleaved read pairs tobwa
. - 32 threads of
bwa
piping output. - 12 threads of
samtools
with 20GB of memory; temporary files inlscratch
.
In code:
fastp -i $read1 -I $read2 \
--stdout --thread 2 \
-j "${logs}/fastp-${SLURM_JOB_ID}.json" \
-h "${logs}/fastp-${SLURM_JOB_ID}.html" \
2> "${logs}/fastp-${SLURM_JOB_ID}.log" \
| bwa mem -v 2 -M -t 32 -p \
-R "@RG\tID:$id\tPL:ILLUMINA\tLB:$lb\tSM:$sm" \
$genome - 2> ${logs}/bwa-${SLURM_JOB_ID}.log \
| samtools sort -T /lscratch/$SLURM_JOBID/part \
-m 1706M \
-@ 12 \
-O BAM \
-o /lscratch/$SLURM_JOBID/$bam \
2> ${logs}/samtools-${SLURM_JOB_ID}.log
For an input file with 201,726,334 read pairs we would expect a runtime of about 2.5h given the rate limits of ~22,000 pairs/sec in the alignment step. Since we are not sure about the total memory usage, we run this job with 46 CPUs and 70GB. Figure 1.5 shows the memory and CPU usage.
Some observations:
- Memory usage peaked just above 60GB.
- The code used only 32 CPUs most the time. This has to do with the fact that
bwa
andsamtools
don’t generally use lots of CPU at the same time.samtools
accumulates alignments until its buffers are full and then sorts. While it sortsbwa
will use less CPU. - The throughput of this run was similar to the throughput of
bwa
alone - about 20-24k read pairs per second.
The code above was run on 32 to 48 CPUs to see if throughput could be maintained at reduced core count. As can be seen in figure 1.6 (A) the throughput does increase linearly with the number of allocated CPUs from 32 to 48. However the increase is modest and efficiency, as measured in reads per second per CPU, decreases from 34 to 48 CPUs (figure 1.6 (B).
Based on this test we conclude that running the combined preprocessing steps with 46 threads on 34 CPUs is the most efficient approach and only results in a 12% increase of the runtime compared to 46 CPUs.
1.4.2 Alternative B: BAM output without coordinate sorting
- 2 threads of
fastp
piping interleaved read pairs tobwa
. - 32 threads of
bwa
piping output. - 16 threads of
samtools
.
In code:
fastp -i $read1 -I $read2 \
--stdout --thread 2 \
-j "${logs}/fastp-${SLURM_JOB_ID}.json" \
-h "${logs}/fastp-${SLURM_JOB_ID}.html" \
2> "${logs}/fastp-${SLURM_JOB_ID}.log" \
| bwa mem -v 2 -M -t 32 -p \
-R "@RG\tID:$id\tPL:ILLUMINA\tLB:$lb\tSM:$sm" \
$genome - 2> ${logs}/bwa-${SLURM_JOB_ID}.log \
| samtools view -@ 16 \
-O BAM \
-o /lscratch/$SLURM_JOBID/$bam \
2> ${logs}/samtools-${SLURM_JOB_ID}.log
For an input file with 201,726,334 read pairs we would expect a runtime of about 2.5h given the rate limits of ~22,000 pairs/sec in the alignment step. Some initial experiments showed that 42GB of memory were required for all test jobs to finish. As for alternative A we tested different numbers of CPUs - from a full allocation of 50 CPUs down to 32 CPUs to account for the fact that all three tools don’t operate at full capacity all the time.
As can be seen in figure 1.7 A the throughput does increase linearly with the number of allocated CPUs from 32 to 50. However the increase is modest and efficiency, as measured in reads per second per CPU decreases from 36 to 50 CPUs (figure 1.7 B).
Based on this test we conclude that running the combined preprocessing steps with 50 threads on 34 CPUs is the most efficient approach.
1.5 Read groups
The SAM format specification uses read groups
(the @RG
header) to associate reads with with different sources (flowcells, lanes, libraries, etc.).
Downstream tools use this information as co-variates when modeling errors. However,
read groups is used slightly differently by different tools. See how
GATK and
sentieon recommend using read groups.
In this example there are 4 lanes of HiSeq 2000 data for each sample spread across a number of
flow cells. The unique read group id for each lane is sample.flowcell.lane.acc
.
1.6 Processing the example trio
Example scripts that implement the optimized approaches from both alternatives described above is available for download here.
The 4 lanes of sequencing data for each of the three samples in our example trio can be trimmed, aligned, and sorted as shown below.
tar -xzf trio_tutoria.tgz
module load jq
cd trio_tutorial/01-preproc
for sample in NA12878 NA12891 NA12892 ; do
[[ -e ${sample}.json ]] || ./mkconfig $sample | jq '.' > ${sample}.json
done
### check out the created config files
### then submit jobs
mkdir -p data
for sample in NA12878 NA12891 NA12892 ; do
# data below are for sorted output. You can choose the unsorted script
# instead
sbatch ./01-preproc.sh ${sample}.json \
\
/fdb/GATK_resource_bundle/hg38-v0/Homo_sapiens_assembly38.fasta ${sample}.bam
data/done
Logfiles, diagnostic files, and sorted BAM files will be saved in ./data/
. The jobs were not
constrained to a specific node type but some basic stats of the full runs of this Chapter are summarized
below (for sorted BAM output):
Sample | Runtime | BAM file size | Mean coverage | CPUh |
---|---|---|---|---|
NA12891 | 11.5h | 115GB | 47x | 391 |
NA12892 | 11.7h | 125GB | 51x | 398 |
NA12878 | 11.5h | 120GB | 47.4x | 391 |
mosdepth
is a good tool to look at coverage across all sites:
mkdir -p mosdepth
export MOSDEPTH_Q0=NO_COVERAGE # 0 -- defined by the arguments to --quantize
export MOSDEPTH_Q1=LOW_COVERAGE # 1..4
export MOSDEPTH_Q2=CALLABLE # 5..149
export MOSDEPTH_Q3=HIGH_COVERAGE # 150 ...
module load mosdepth/0.3.0
for sample in NA12891 NA12892 NA12878 ; do
if [[ -e mosdepth/${sample}.quantized.mosdepth.summary.txt ]] ; then
printf "skipping %s - already done\n" "$sample"
else
mosdepth -t3 -n --quantize 0:1:5:150: mosdepth/$sample.quantized data/$sample.bam
fi
done
Figure 1.8 (A) shows coverage of all autosomes for the three samples. Figure 1.8 (B) shows coverage of X and Y chromosomes consistent with the expected sex.