Pysamstats is a fast Python and command-line utility for extracting simple statistics against genome positions based on sequence alignments from a SAM or BAM file.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=4g
[user@@cn3200 ~]$ module load pysamstats
[+] Loading java 1.8.0_11 ...
[+] Loading samtools 1.9 ...
[+] Loading pysamstats 0.24.3
[user@biowulf]$ pysamstats --help
age: pysamstats [options] FILE
Calculate statistics against genome positions based on sequence alignments
from a SAM or BAM file and print them to stdout.
Options:
-h, --help show this help message and exit
-t TYPE, --type=TYPE Type of statistics to print, one of: alignment_binned,
baseq, baseq_ext, baseq_ext_strand, baseq_strand,
coverage, coverage_binned, coverage_ext,
coverage_ext_binned, coverage_ext_strand, coverage_gc,
coverage_strand, mapq, mapq_binned, mapq_strand, tlen,
tlen_binned, tlen_strand, variation, variation_strand.
-c CHROMOSOME, --chromosome=CHROMOSOME
Chromosome name.
-s START, --start=START
Start position (1-based).
-e END, --end=END End position (1-based).
-z, --zero-based Use zero-based coordinates (default is false, i.e.,
use one-based coords).
-u, --truncate Truncate pileup-based stats so no records are emitted
outside the specified position range.
-d, --pad Pad pileup-based stats so a record is emitted for
every position (default is only covered positions).
-D MAX_DEPTH, --max-depth=MAX_DEPTH
Maximum read depth permitted in pileup-based
statistics. The default limit is 8000.
-f FASTA, --fasta=FASTA
Reference sequence file, only required for some
statistics.
-o, --omit-header Omit header row from output.
-p N, --progress=N Report progress every N rows.
--window-size=N Size of window for binned statistics (default is 300).
--window-offset=N Window offset to use for deciding which genome
position to report binned statistics against. The
default is 150, i.e., the middle of 300bp window.
--format=FORMAT Output format, one of {tsv, csv, hdf5} (defaults to
tsv). N.B., hdf5 requires PyTables to be installed.
--output=OUTPUT Path to output file. If not provided, write to stdout.
--fields=FIELDS Comma-separated list of fields to output (defaults to
all fields).
--hdf5-group=HDF5_GROUP
Name of HDF5 group to write to (defaults to the root
group).
--hdf5-dataset=HDF5_DATASET
Name of HDF5 dataset to create (defaults to "data").
--hdf5-complib=HDF5_COMPLIB
HDF5 compression library (defaults to zlib).
--hdf5-complevel=HDF5_COMPLEVEL
HDF5 compression level (defaults to 5).
--hdf5-chunksize=HDF5_CHUNKSIZE
Size of chunks in number of bytes (defaults to 2**20).
Pileup-based statistics types (each row has statistics over reads in a pileup column):
* coverage - Number of reads aligned to each genome position
(total and properly paired).
* coverage_strand - As coverage but with forward/reverse strand counts.
* coverage_ext - Various additional coverage metrics, including
coverage for reads not properly paired (mate
unmapped, mate on other chromosome, ...).
* coverage_ext_strand - As coverage_ext but with forward/reverse strand counts.
* coverage_gc - As coverage but also includes a column for %GC.
* variation - Numbers of matches, mismatches, deletions,
insertions, etc.
* variation_strand - As variation but with forward/reverse strand counts.
* tlen - Insert size statistics.
* tlen_strand - As tlen but with statistics by forward/reverse strand.
* mapq - Mapping quality statistics.
* mapq_strand - As mapq but with statistics by forward/reverse strand.
* baseq - Base quality statistics.
* baseq_strand - As baseq but with statistics by forward/reverse strand.
* baseq_ext - Extended base quality statistics, including qualities
of bases matching and mismatching reference.
* baseq_ext_strand - As baseq_ext but with statistics by forward/reverse strand.
Binned statistics types (each row has statistics over reads aligned starting within a genome window):
* coverage_binned - As coverage but binned.
* coverage_ext_binned - As coverage_ext but binned.
* mapq_binned - Similar to mapq but binned.
* alignment_binned - Aggregated counts from cigar strings.
* tlen_binned - As tlen but binned.
Examples:
pysamstats --type coverage example.bam > example.coverage.txt
pysamstats --type coverage --chromosome Pf3D7_v3_01 --start 100000 --end 200000 example.bam > example.coverage.txt
Version: 0.24.3 (pysam 0.8.4)
Copy pysamstats sample data to your currend directory:
[user@cn3200 ~]$ cp $PYSAMSTATS_DATA/* .Run pysamstats on the sample data:
[user@cn3200 ~]$ pysamstats -t coverage test.bam
chrom pos reads_all reads_pp
Pf3D7_01_v3 925 12 9
Pf3D7_01_v3 926 21 15
Pf3D7_01_v3 927 32 21
Pf3D7_01_v3 928 44 26
Pf3D7_01_v3 929 61 39
Pf3D7_01_v3 930 76 53
Pf3D7_01_v3 931 90 65
Pf3D7_01_v3 932 107 76
Pf3D7_01_v3 933 114 80
Pf3D7_01_v3 934 129 90
Pf3D7_01_v3 935 140 96
Pf3D7_01_v3 936 160 108
Pf3D7_01_v3 937 170 117
Pf3D7_01_v3 938 185 123
...
[user@cn3200 ~]$ pysamstats -t alignment_binned test.bam
chrom pos reads_all bases_all M I D N S H P=X
Pf3D7_01_v3 151 0 0 0 0 0 0 0 0 0 00
Pf3D7_01_v3 451 0 0 0 0 0 0 0 0 0 00
Pf3D7_01_v3 751 0 0 0 0 0 0 0 0 0 00
Pf3D7_01_v3 1051 1228 93328 92479 37 42 0 812 0 0 00
Pf3D7_01_v3 1351 497 37772 37700 1 4 0 71 0 0 00
Pf3D7_01_v3 1651 218 16568 16472 2 2 0 94 0 0 00
Pf3D7_01_v3 1951 118 8968 8944 2 2 0 22 0 0 00
...
[user@cn3200 ~]$ pysamstats -t baseq test.bam
chrom pos reads_all reads_pp rms_baseq rms_baseq_pp
Pf3D7_01_v3 925 12 9 32 32
Pf3D7_01_v3 926 21 15 28 27
Pf3D7_01_v3 927 32 21 30 30
Pf3D7_01_v3 928 44 26 29 28
Pf3D7_01_v3 929 61 39 32 32
...
[user@cn3200 ~]$ python
>>> import pysam
>>> import pysamstats
>>> import matplotlib.pyplot as plt
>>> mybam = pysam.AlignmentFile('rna.bam')
>>> a = pysamstats.load_coverage(mybam, chrom='Pf3D7_01_v3', start=100, end=2000)
>>> plt.plot(a.pos, a.reads_all)
>>> plt.show()
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. pysamstats.sh). For example:
#!/bin/bash #SBATCH --mem=4g module load pysamstats pysamstats [options] file1.bam > output1 pysamstats [options] file2.bam > output2 ...
Submit this job using the Slurm sbatch command.
sbatch pysamstats.sh
Create a swarmfile (e.g. pysamstats.swarm). For example:
#!/bin/bash cd /scratch/$USER pysamstats [options] file1.bam > output1 pysamstats [options] file2.bam > output2 ...
Submit this job using the swarm command.
swarm -f pysamstats.swarm -g 4