High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
samtools on Biowulf & Helix

Description

Samtools is a suite of applications for processing high throughput sequencing data:

Each of the main tools and file formats have their own manpages (e. g. man samtools or man sam).

There may be multiple versions of samtools available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail samtools 

To select a module use

module load samtools/[version]

where [version] is the version of choice.

Some samtools tools are multithreaded. Make sure to match the number of cpus requested with the number of threads.

Notes on releases

For more detailed release notes see the GitHub release pages for htslib and samtools.

Environment variables set

References

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive and use similar to the example below. Note that interactive work with samtools should not be done on the biowulf login node nor on helix.

Here we allocate 8GB of memory and 4 CPUs to be used to run more than one compression thread. In addition 30GB of lscratch is used to store temporary files.

biowulf$ sinteractive --mem=8g --cpus-per-task=4 --gres=lscratch:30
salloc.exe: Granted job allocation 17404
node$ module load samtools
node$ cd /data/$USER/test_data

convert sam to bam file - 4 compression threads

node$ samtools view -Sb -@ 4 bam/input.sam > bam/output.bam

Sort and convert sam file - 4 sorting and compression threads. Note that the default memory per thread for the sort can be changed with -m and the default location of the temporary files with -T. Once each thread reads in enough data to fill its memory, it will write the data to a temporary file before continuing. Therefore, for performance it is best to keep the whole sort in memory. If that is not possible, use lscratch for the temporary files.

If the memory per thread is set to a low value for example by not providing a unit [K/M/G], a potentially very large number of temporary files will be created. Please take care to properly set this value. Starting with version 1.4 there is a minimum amount of memory per thread. Check the manpage for details.

node$ samtools sort -O bam -T /lscratch/$SLURM_JOB_ID/read1_250k \
             -@4 -m 1800M bam/read1_250k.bam > bam/read1_250k_sorted.bam

Index the resulting bam file and get some basic stats.

node$ samtools index bam/read1_250k_sorted.bam
node$ samtools flagstat bam/read1_250k_sorted.bam
     170016 + 0 in total (QC-passed reads + QC-failed reads)
     0 + 0 secondary
     0 + 0 supplementary
     0 + 0 duplicates
     170016 + 0 mapped (100.00%:-nan%)
     0 + 0 paired in sequencing
     0 + 0 read1
     0 + 0 read2
     0 + 0 properly paired (-nan%:-nan%)
     0 + 0 with itself and mate mapped
     0 + 0 singletons (-nan%:-nan%)
     0 + 0 with mate mapped to a different chr
     0 + 0 with mate mapped to a different chr (mapQ>=5)

Much more detailed stats (coverage, GC content, quality, ...)

node$ samtools stats bam/read1_250k_sorted.bam

Bgzip and index a vcf file:

node$ bgzip -c vcf/gcat_set_053_raw.vcf > vcf/gcat_set_053_raw.vcf.gz
node$ tabix vcf/gcat_set_053_raw.vcf.gz

Accessing data stored on Amazon AWS and Google cloud storage

Access public data via https or on Amazon S3 and Google cloud storage (GCS). Only version 1.4 and newer support both protocols:

node$ module load samtools/1.4
node$ samtools view \
https://storage.googleapis.com/genomics-public-data/platinum-genomes/bam/NA12877_S1.bam \
chr20:100000-100001 | head -5
[M::test_and_fetch] downloading file 'https://storage.googleapis.com/genomics-public-data/platinum-genom es/bam/NA12877_S1.bam.bai' to local directory
[...snip...]
node$ samtools view \
s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam \
20:100000-100001
[M::test_and_fetch] downloading file 's3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mappe d.illumina.mosaik.CEU.exome.20110411.bam.bai' to local directory
[...snip...]
node$ unset GCS_OAUTH_TOKEN
node$ samtools view \
gs://genomics-public-data/platinum-genomes/bam/NA12877_S1.bam \
chr20:100000-100001 | head -5
[...snip...]

Note that if you have the environment variable GCS_OAUTH_TOKEN set the last command will fail with a permissions error. This variable is, however, needed to access data in your private GCS buckets.

Now, upload data to your private GCS bucket and access it with samtools. This assumes that you have GCS set up and the google-cloud-sdk set up and initiated. In the example below, the bucket is named test-2050be34. One complication is that the python environments required for Google cloud SDK also contain a samtools version which may be older than 1.4 and may not work. If you get 'Protocol not supported' errors check that a samtools >= 1.4 is first on your path.

node$ module load google-cloud-sdk
node$ # upload some data to your bucket
node$ gsutil cp gcat_set_053.bam* gs://test-2050be34/
node$ # create a application token with gcloud. This should only be
      # necessary once; this involves a sign-in on the web
node$ gcloud auth application-default login
node$ # save token to environment variable for samtools.
node$ export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
node$ # make sure the python/bin/samtools does not interfere
node$ module unload google-cloud-sdk
node$ samtools view gs://test-2050be34/gcat_set_053.bam chr20:100000-200001 | head -5

For S3, samtools will try to obtain tokens from either the URL, environment variables, or ~/.aws/credentials or ~/.s3cfg to access private buckets. Profiles can be used in which case URLs take the format s3://profile@bucket/....

Accessing data on the HPC object store

Samtools starting with version 1.5 can now access data in our local object store with some preparation. First, let's create or modify a ~/.s3cfg file to be able to access our object store. Note that the settings are kept in a named profile (obj) rather than the default profile. This helps to avoid interfering with the commands above. However, you may choose to make the local object store the default, in which case the profile name does not have to be included in the s3 URLs. The important settings are access_key, secret_key, host_base, and url_mode.

node$ cat >> $HOME/.s3cfg <<EOF
[obj]
access_key = [...your key...]
secret_key = [...your secret...]
host_base = os2access1
host_bucket = os2access1/$USER
url_mode = path
proxy_host = dtn03-e0
proxy_port = 3128
use_https = False

encrypt = False
server_side_encryption = False
human_readable_sizes = True
website_endpoint = http://os2access1/$USER
EOF

Now copy a bam file and it's index to the object store

node$ obj_put gcat_set_053.bam
node$ obj_put gcat_set_053.bam.bai

And access them directly with samtools. Note that the obj@ is only required if your HPC object store settings are not the default in ~/.s3cfg. Note also that for now the +http is required to force samtools to use http. This may change in the future.

node$ samtools view -H s3+http://obj@wresch/gcat_set_053.bam
@HD     VN:1.3  SO:coordinate
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
[...snip...]
node$ samtools view s3+http://obj@wresch/gcat_set_053.bam chr1:2000000-2000100
11V6WR1:111:D1375ACXX:1:2304:6028:48213 99      chr1    1999930 60      100M    = 
[...snip...]

Leave the interactive session

node$ exit
biowulf$ 
Batch job on Biowulf

Create a batch script for the job. For example, the following script collects statistics on a bam file and then uses plot-bamstat to create various graphs based on the bamstats file.

#! /bin/bash
set -e

module load samtools/1.3
module load gnuplot
cd /data/$USER/test_data
samtools stats bam/gcat_set_025.bam > bam/gcat_set_025.bam.stats
plot-bamstats -p  bam/gcat_set_025_stats/ bam/gcat_set_025.bam.stats

and submit the script to the batch queue with

biowulf$ sbatch sam.sh
 17399
biowulf$ jobload -u $USER
      JOBID      RUNTIME     NODES   CPUS    AVG CPU%            MEMORY
                                                                Used/Alloc
      17399     00:00:07     p1005      2       39.55       1.5 MB/1.5 GB
Swarm of jobs on Biowulf

Create a command file with one command per line (line continuations are allowed). For example, to remove duplicate reads from a set of bam files:

samtools rmdup bam/read1_250k_sorted.bam bam/read1_250k_sorted_rmdup.bam
samtools rmdup bam/read1_500k_sorted.bam bam/read1_500k_sorted_rmdup.bam

The command file is used by swarm to submit one job per command as a job array:

biowulf$ swarm -f sam.swarm
  2 commands run in 2 tasks, each requiring 1 gb, 1 thread, and 0 gb diskspace
  17402
biowulf$ jobload -u $USER
        JOBID      RUNTIME     NODES   CPUS    AVG CPU%            MEMORY
                                                                 Used/Alloc
      17402_1     00:00:02     p1005      2        0.00            0/1.0 GB
      17402_0     00:00:02     p1005      2        0.00            0/1.0 GB