pysamstats: extracting simple statistics against genome positions based on sequence alignments from a SAM or BAM file
Pysamstats is a fast Python and command-line utility for extracting simple statistics against genome positions based on sequence alignments from a SAM or BAM file.
Documentation
Important Notes
- Module Name: pysamstats (see the modules page for more information)
- Unusual environment variables set
- PYSAMSTATS_HOME installation directory
- PYSAMSTATS_BIN executable directory
- PYSAMSTATS_DATA sample data dorectory
- PYSAMSTATS_SRC sourfce code directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=4g [user@@cn3200 ~]$ module load pysamstats [+] Loading java 1.8.0_11 ... [+] Loading samtools 1.9 ... [+] Loading pysamstats 0.24.3 [user@biowulf]$ pysamstats --help age: pysamstats [options] FILE Calculate statistics against genome positions based on sequence alignments from a SAM or BAM file and print them to stdout. Options: -h, --help show this help message and exit -t TYPE, --type=TYPE Type of statistics to print, one of: alignment_binned, baseq, baseq_ext, baseq_ext_strand, baseq_strand, coverage, coverage_binned, coverage_ext, coverage_ext_binned, coverage_ext_strand, coverage_gc, coverage_strand, mapq, mapq_binned, mapq_strand, tlen, tlen_binned, tlen_strand, variation, variation_strand. -c CHROMOSOME, --chromosome=CHROMOSOME Chromosome name. -s START, --start=START Start position (1-based). -e END, --end=END End position (1-based). -z, --zero-based Use zero-based coordinates (default is false, i.e., use one-based coords). -u, --truncate Truncate pileup-based stats so no records are emitted outside the specified position range. -d, --pad Pad pileup-based stats so a record is emitted for every position (default is only covered positions). -D MAX_DEPTH, --max-depth=MAX_DEPTH Maximum read depth permitted in pileup-based statistics. The default limit is 8000. -f FASTA, --fasta=FASTA Reference sequence file, only required for some statistics. -o, --omit-header Omit header row from output. -p N, --progress=N Report progress every N rows. --window-size=N Size of window for binned statistics (default is 300). --window-offset=N Window offset to use for deciding which genome position to report binned statistics against. The default is 150, i.e., the middle of 300bp window. --format=FORMAT Output format, one of {tsv, csv, hdf5} (defaults to tsv). N.B., hdf5 requires PyTables to be installed. --output=OUTPUT Path to output file. If not provided, write to stdout. --fields=FIELDS Comma-separated list of fields to output (defaults to all fields). --hdf5-group=HDF5_GROUP Name of HDF5 group to write to (defaults to the root group). --hdf5-dataset=HDF5_DATASET Name of HDF5 dataset to create (defaults to "data"). --hdf5-complib=HDF5_COMPLIB HDF5 compression library (defaults to zlib). --hdf5-complevel=HDF5_COMPLEVEL HDF5 compression level (defaults to 5). --hdf5-chunksize=HDF5_CHUNKSIZE Size of chunks in number of bytes (defaults to 2**20). Pileup-based statistics types (each row has statistics over reads in a pileup column): * coverage - Number of reads aligned to each genome position (total and properly paired). * coverage_strand - As coverage but with forward/reverse strand counts. * coverage_ext - Various additional coverage metrics, including coverage for reads not properly paired (mate unmapped, mate on other chromosome, ...). * coverage_ext_strand - As coverage_ext but with forward/reverse strand counts. * coverage_gc - As coverage but also includes a column for %GC. * variation - Numbers of matches, mismatches, deletions, insertions, etc. * variation_strand - As variation but with forward/reverse strand counts. * tlen - Insert size statistics. * tlen_strand - As tlen but with statistics by forward/reverse strand. * mapq - Mapping quality statistics. * mapq_strand - As mapq but with statistics by forward/reverse strand. * baseq - Base quality statistics. * baseq_strand - As baseq but with statistics by forward/reverse strand. * baseq_ext - Extended base quality statistics, including qualities of bases matching and mismatching reference. * baseq_ext_strand - As baseq_ext but with statistics by forward/reverse strand. Binned statistics types (each row has statistics over reads aligned starting within a genome window): * coverage_binned - As coverage but binned. * coverage_ext_binned - As coverage_ext but binned. * mapq_binned - Similar to mapq but binned. * alignment_binned - Aggregated counts from cigar strings. * tlen_binned - As tlen but binned. Examples: pysamstats --type coverage example.bam > example.coverage.txt pysamstats --type coverage --chromosome Pf3D7_v3_01 --start 100000 --end 200000 example.bam > example.coverage.txt Version: 0.24.3 (pysam 0.8.4)Copy pysamstats sample data to your currend directory:
[user@cn3200 ~]$ cp $PYSAMSTATS_DATA/* .Run pysamstats on the sample data:
[user@cn3200 ~]$ pysamstats -t coverage test.bam chrom pos reads_all reads_pp Pf3D7_01_v3 925 12 9 Pf3D7_01_v3 926 21 15 Pf3D7_01_v3 927 32 21 Pf3D7_01_v3 928 44 26 Pf3D7_01_v3 929 61 39 Pf3D7_01_v3 930 76 53 Pf3D7_01_v3 931 90 65 Pf3D7_01_v3 932 107 76 Pf3D7_01_v3 933 114 80 Pf3D7_01_v3 934 129 90 Pf3D7_01_v3 935 140 96 Pf3D7_01_v3 936 160 108 Pf3D7_01_v3 937 170 117 Pf3D7_01_v3 938 185 123 ... [user@cn3200 ~]$ pysamstats -t alignment_binned test.bam chrom pos reads_all bases_all M I D N S H P=X Pf3D7_01_v3 151 0 0 0 0 0 0 0 0 0 00 Pf3D7_01_v3 451 0 0 0 0 0 0 0 0 0 00 Pf3D7_01_v3 751 0 0 0 0 0 0 0 0 0 00 Pf3D7_01_v3 1051 1228 93328 92479 37 42 0 812 0 0 00 Pf3D7_01_v3 1351 497 37772 37700 1 4 0 71 0 0 00 Pf3D7_01_v3 1651 218 16568 16472 2 2 0 94 0 0 00 Pf3D7_01_v3 1951 118 8968 8944 2 2 0 22 0 0 00 ... [user@cn3200 ~]$ pysamstats -t baseq test.bam chrom pos reads_all reads_pp rms_baseq rms_baseq_pp Pf3D7_01_v3 925 12 9 32 32 Pf3D7_01_v3 926 21 15 28 27 Pf3D7_01_v3 927 32 21 30 30 Pf3D7_01_v3 928 44 26 29 28 Pf3D7_01_v3 929 61 39 32 32 ... [user@cn3200 ~]$ python >>> import pysam >>> import pysamstats >>> import matplotlib.pyplot as plt >>> mybam = pysam.AlignmentFile('rna.bam') >>> a = pysamstats.load_coverage(mybam, chrom='Pf3D7_01_v3', start=100, end=2000) >>> plt.plot(a.pos, a.reads_all) >>> plt.show()

End the interactive session:
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. pysamstats.sh). For example:
#!/bin/bash #SBATCH --mem=4g module load pysamstats pysamstats [options] file1.bam > output1 pysamstats [options] file2.bam > output2 ...
Submit this job using the Slurm sbatch command.
sbatch pysamstats.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. pysamstats.swarm). For example:
#!/bin/bash cd /scratch/$USER pysamstats [options] file1.bam > output1 pysamstats [options] file2.bam > output2 ...
Submit this job using the swarm command.
swarm -f pysamstats.swarm -g 4