samtools on Biowulf

Quick Links

Interactive job

Batch job

Swarm of jobs

Samtools is a suite of applications for processing high throughput sequencing data:

samtools is used for working with SAM, BAM, and CRAM files containing aligned sequences. It can also be used to index fasta files.
bcftools is used for working with BCF2, VCF, and gVCF files containing variant calls.
bgzip is a block compression/decompression utility. bgzip compressed files can be decompressed with gunzip. However, they are in fact made up of a number of compressed blocks which facilitates indexing and random access.
tabix is used to index bgzip compressed tabular data. It can create .tbi and .csi format indices.
wgsim is used to simulate NGS reads.
various conversion utilities (bowtie2sam, novo2sam, blast2sam, psl2sam, ...)
htslib is a library for reading and writing the formats mentioned above. samtools and bcftools are based on htslib

Each of the main tools and file formats have their own manpages (e. g. man samtools or man sam).

Notes on releases

For more detailed release notes see the GitHub release pages for htslib and samtools.

There have been a number of visible changes with the transition to the > 1.0 versions based on htslib.
Accessing files via https in addition to http and ftp became available in samtools 1.3. Use URLs like http://..., https://..., or ftp://...
Accessing files stored in Amazon S3 became available in samtools 1.3. Use pseudo URLs in the format s3://bucket/...
Accessing files stored in Google cloud storage became available in samtools 1.4. Use pseudo URLs in the format gs://bucket/...
Accessing files in the HPC object store is possible since version 1.5 with a local patch.
Version 1.10 is currently not able to access the HPC object store

References:

Heng Li et al. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 2009, 25:2078-2079. Pubmed | PMC | Journal
Heng Li. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011, 27:2987-2993. Pubmed | PMC | Journal

Documentation

Important Notes

Module Name: samtools (see the modules page for more information)
Samtools is a multithreaded application
Example data can be found in $SAMTOOLS_TEST_DATA
Users can compile bcftools plugins for bcftools ≥ 1.12 and place them in ~/.local/libexec/bcftools/VERSION (replace VERSION with actual version). Then, when the module is loaded, plugins from that directory will be available for use with bcftools.

If the memory per thread is set to a low value for example by not providing a unit [K/M/G], a potentially very large number of temporary files will be created. Please take care to properly set this value. Starting with version 1.4 there is a minimum amount of memory per thread. Check the manpage for details.

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=8g --cpus-per-task=4 --gres=lscratch:30
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load samtools
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ cp $SAMTOOLS_TEST_DATA/* .
[user@cn3144 ~]$ ls -l
-rw-r--r-- 1 user group 45M Jan 18 16:15 mm10_500k_pe.bam
[user@cn3144 ~]$ # show the header
[user@cn3144 ~]$ samtools view -H mm10_500k_pe.bam
@SQ     SN:chr10        LN:130694993
@SQ     SN:chr11        LN:122082543
@SQ     SN:chr12        LN:120129022
@SQ     SN:chr13        LN:120421639
...snip...

[user@cn3144 ~]$ # sort using lscratch for temp files
[user@cn3144 ~]$ samtools sort -O bam -o mm10_500k_pe_sorted.bam \
                       -T /lscratch/$SLURM_JOB_ID/mm10_500k_pe \
                       -@4 -m 1800M mm10_500k_pe.bam
[user@cn3144 ~]$ # index the sorted file
[user@cn3144 ~]$ samtools index mm10_500k_pe_sorted.bam

[user@cn3144 ~]$

Accessing data stored on Amazon AWS and Google cloud storage

Access public data via https or on Amazon S3 and Google cloud storage (GCS). Only version 1.4 and newer support both protocols:

[user@cn3144 ~]$ module load samtools
[user@cn3144 ~]$ samtools view \
https://storage.googleapis.com/genomics-public-data/platinum-genomes/bam/NA12877_S1.bam \
chr20:100000-100001 | head -5
[M::test_and_fetch] downloading file 'https://storage.googleapis.com/genomics-public-data/platinum-genom es/bam/NA12877_S1.bam.bai' to local directory
...snip...
[user@cn3144 ~]$ export HTS_S3_V2=1  ### this is only required for 1.10 for the public buckets
[user@cn3144 ~]$ samtools view \
s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam \
20:100000-100001
[M::test_and_fetch] downloading file 's3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mappe d.illumina.mosaik.CEU.exome.20110411.bam.bai' to local directory
...snip...
[user@cn3144 ~]$ unset GCS_OAUTH_TOKEN
[user@cn3144 ~]$ samtools view \
gs://genomics-public-data/platinum-genomes/bam/NA12877_S1.bam \
chr20:100000-100001 | head -5
...snip...

Note that if you have the environment variable GCS_OAUTH_TOKEN set the last command will fail with a permissions error. This variable is, however, needed to access data in your private GCS buckets.

Now, upload data to your private GCS bucket and access it with samtools. This assumes that you have GCS set up and the google-cloud-sdk set up and initiated. In the example below, the bucket is named test-2050be34. One complication is that the python environments required for Google cloud SDK also contain a samtools version which may be older than 1.4 and may not work. If you get 'Protocol not supported' errors check that a samtools >= 1.4 is first on your path.

[user@cn3144 ~]$ module load google-cloud-sdk
[user@cn3144 ~]$ # upload some data to your bucket
[user@cn3144 ~]$ gsutil cp gcat_set_053.bam* gs://test-2050be34/
[user@cn3144 ~]$ # create a application token with gcloud. This should only be
      # necessary once; this involves a sign-in on the web
[user@cn3144 ~]$ gcloud auth application-default login
[user@cn3144 ~]$ # save token to environment variable for samtools.
[user@cn3144 ~]$ export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
[user@cn3144 ~]$ # make sure the python/bin/samtools does not interfere
[user@cn3144 ~]$ module unload google-cloud-sdk
[user@cn3144 ~]$ samtools view gs://test-2050be34/gcat_set_053.bam chr20:100000-200001 | head -5

For S3, samtools will try to obtain tokens from either the URL, environment variables, or ~/.aws/credentials or ~/.s3cfg to access private buckets. Profiles can be used in which case URLs take the format s3://profile@bucket/....

Accessing data on the HPC object store

Currently not available for 1.10

Samtools starting with version 1.5 can now access data in our local object store with some preparation. First, let's create or modify a ~/.s3cfg file to be able to access our object store. Note that the settings are kept in a named profile (obj) rather than the default profile. This helps to avoid interfering with the commands above. However, you may choose to make the local object store the default, in which case the profile name does not have to be included in the s3 URLs. The important settings are access_key, secret_key, host_base, and url_mode.

node$ cat >> $HOME/.s3cfg <<EOF
[obj]
access_key = [...your key...]
secret_key = [...your secret...]
host_base = os2access1
host_bucket = os2access1/$USER
url_mode = path
proxy_host = dtn03-e0
proxy_port = 3128
use_https = False

encrypt = False
server_side_encryption = False
human_readable_sizes = True
website_endpoint = http://os2access1/$USER
EOF

Now copy a bam file and it's index to the object store

[user@cn3144 ~]$ obj_put gcat_set_053.bam
[user@cn3144 ~]$ obj_put gcat_set_053.bam.bai

And access them directly with samtools. Note that the obj@ is only required if your HPC object store settings are not the default in ~/.s3cfg. Note also that for now the +http is required to force samtools to use http. This may change in the future.

[user@cn3144 ~]$ samtools view -H s3+http://obj@user/gcat_set_053.bam
@HD     VN:1.3  SO:coordinate
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
[...snip...]
[user@cn3144 ~]$ samtools view s3+http://obj@user/gcat_set_053.bam chr1:2000000-2000100
11V6WR1:111:D1375ACXX:1:2304:6028:48213 99      chr1    1999930 60      100M    = 
[...snip...]

End the interactive session

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. samtools.sh), which uses the input file 'samtools.in'. For example:

#!/bin/bash
set -e

module load samtools 
module load gnuplot 
cd /data/$USER/test_data
samtools stats bam/gcat_set_025.bam > bam/gcat_set_025.bam.stats
plot-bamstats -p  bam/gcat_set_025_stats/ bam/gcat_set_025.bam.stats

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] samtools.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. samtools.swarm). For example:

samtools rmdup bam/read1_250k_sorted.bam bam/read1_250k_sorted_rmdup.bam
samtools rmdup bam/read1_500k_sorted.bam bam/read1_500k_sorted_rmdup.bam

Submit this job using the swarm command.

swarm -f samtools.swarm [-g #] [-t #] --module samtools

where

-g #	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t #	Number of threads/CPUs required for each process (1 line in the swarm command file).
--module samtools	Loads the samtools module for each subjob in the swarm