Biowulf High Performance Computing at the NIH
pycoQC: interactive quality control for Oxford Nanopore.

pycoQC is a new tool to generate interactive quality control metrics and plots from basecalled nanopore reads or summary files generated by the basecallers Albacore, Guppy or MinKNOW. pycoQC has several novel features, including: 1) python support for creation of dynamic D3.js visualizations and interactive data exploration in Jupyter Notebooks; 2) simple command line interface to generate customizable interactive HTML reports; and 3) multiprocessing FAST5 feature extraction program to generate a summary file directly from FAST5 files.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive 
[user@cn4471 ~]$ module load pycoQC 
[+] Loading pycoQC 2.5.2  ...
Copy sample data to the current folder:
[user@cn4471 ~]$ pycoQC -h
usage: pycoQC [-h] [--version]
              [--summary_file [SUMMARY_FILE [SUMMARY_FILE ...]]]
              [--barcode_file [BARCODE_FILE [BARCODE_FILE ...]]]
              [--bam_file [BAM_FILE [BAM_FILE ...]]]
              [--html_outfile HTML_OUTFILE] [--json_outfile JSON_OUTFILE]
              [--min_pass_qual MIN_PASS_QUAL] [--min_pass_len MIN_PASS_LEN]
              [--filter_calibration] [--filter_duplicated]
              [--min_barcode_percent MIN_BARCODE_PERCENT]
              [--report_title REPORT_TITLE] [--template_file TEMPLATE_FILE]
              [--config_file CONFIG_FILE] [--skip_coverage_plot]
              [--sample SAMPLE] [--default_config] [-v | -q]

pycoQC computes metrics and generates interactive QC plots from the sequencing summary
report generated by Oxford Nanopore technologies basecallers

* Minimal usage
    pycoQC -f sequencing_summary.txt -o pycoQC_output.html
* Including Guppy barcoding file + html output + json output
    pycoQC -f sequencing_summary.txt -b barcoding_sequencing.txt -o pycoQC_output.html -j pycoQC_output.json
* Including Bam file + html output
    pycoQC -f sequencing_summary.txt -a alignment.bam -o pycoQC_output.html

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -v, --verbose         Increase verbosity
  -q, --quiet           Reduce verbosity

Input/output options:
  --summary_file [SUMMARY_FILE [SUMMARY_FILE ...]], -f [SUMMARY_FILE [SUMMARY_FILE ...]]
                        Path to a sequencing_summary generated by Albacore
                        1.0.0 + ( / Guppy 2.1.3+
                        (guppy_basecaller). One can also pass multiple space
                        separated file paths or a UNIX style regex matching
                        multiple files (Required)
  --barcode_file [BARCODE_FILE [BARCODE_FILE ...]], -b [BARCODE_FILE [BARCODE_FILE ...]]
                        Path to the barcode_file generated by Guppy 2.1.3+
                        (guppy_barcoder) or Deepbinner 0.2.0+. This is not a
                        required file. One can also pass multiple space
                        separated file paths or a UNIX style regex matching
                        multiple files (optional)
  --bam_file [BAM_FILE [BAM_FILE ...]], -a [BAM_FILE [BAM_FILE ...]]
                        Path to a Bam file corresponding to reads in the
                        summary_file. Preferably aligned with Minimap2 One can
                        also pass multiple space separated file paths or a
                        UNIX style regex matching multiple files (optional)
  --html_outfile HTML_OUTFILE, -o HTML_OUTFILE
                        Path to an output html file report (required if
                        json_outfile not given)
  --json_outfile JSON_OUTFILE, -j JSON_OUTFILE
                        Path to an output json file report (required if
                        html_outfile not given)

Filtering options:
  --min_pass_qual MIN_PASS_QUAL
                        Minimum quality to consider a read as 'pass' (default:
  --min_pass_len MIN_PASS_LEN
                        Minimum read length to consider a read as 'pass'
                        (default: 0)
  --filter_calibration  If given, reads flagged as calibration strand by the
                        basecaller are removed (default: False)
  --filter_duplicated   If given, duplicated read_ids are removed but the
                        first occurence is kept (Guppy sometimes outputs the
                        same read multiple times) (default: False)
  --min_barcode_percent MIN_BARCODE_PERCENT
                        Minimal percent of total reads to retain barcode
                        label. If below, the barcode value is set as
                        `unclassified` (default: 0.1)

HTML report options:
  --report_title REPORT_TITLE
                        Title to use in the html report (default: PycoQC
  --template_file TEMPLATE_FILE
                        Jinja2 html template for the html report (default: )
  --config_file CONFIG_FILE
                        Path to a JSON configuration file for the html report.
                        If not provided, looks for it in ~/.pycoQC and
                        ~/.config/pycoQC/config. If it's still not found,
                        falls back to default parameters. The first level keys
                        are the names of the plots to be included. The second
                        level keys are the parameters to pass to each plotting
                        function (default: )")
  --skip_coverage_plot  Skip the coverage plot in HTML report. Useful when
                        using a reference file containing many sequences, i.e.
                        transcriptome (default: False)

Other options:
  --sample SAMPLE       If not None a n number of reads will be randomly
                        selected instead of the entire dataset for ploting
                        function (deterministic sampling) (default: 100000)
  --default_config, -d  Print default configuration file. Can be used to
                        generate a template JSON file (default: False)
Copy sample data to your current folder:
[user@cn4471 ~]$ cp $PYCOQC_DATA/* .
[user@cn4471 ~]$ ls  
Run pycoQC on the sample data:
[user@cn4471 ~]$ pycoQC -f Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz -b Guppy-2.1.3_basecall-1D_DNA_barcoding_summary.txt.gz -o pycoQC_output.html -j pycoQC_output.json 
Check input data files
Parse data files
Merge data
Cleaning data
        Discarding lines containing NA values
                0 reads discarded
        Filtering out zero length reads
                0 reads discarded
        Sorting run IDs by decreasing throughput
                Run-id order ['c4981b897c2bb47fed99916c19c9bd1bd43267a2']
        Reordering runids
                Processing reads with Run_ID c4981b897c2bb47fed99916c19c9bd1bd43267a2 / time offset: 0
        Cleaning up low frequency barcodes
                0 reads with low frequency barcode unset
        Cast value to appropriate type
        Reindexing dataframe by read_ids
                10,000 Final valid reads
Loading plotting interface
        Found 10,000 total reads
        Found 9,065 pass reads (qual >= 7.0 and length >= 0)
Generating HTML report
        Parsing html config file
        Running method run_summary
                Computing plot
        Running method basecall_summary
                Computing plot
        Running method alignment_summary
                No Alignment information available
        Running method read_len_1D
                Computing plot
        Running method align_len_1D
                No Alignment information available
        Running method read_qual_1D
                Computing plot
        Running method identity_freq_1D
                No identity frequency information available
        Running method read_len_read_qual_2D
                Computing plot
        Running method read_len_align_len_2D
                No Alignment information available
        Running method align_len_identity_freq_2D
                No identity frequency information available
        Running method read_qual_identity_freq_2D
                No identity frequency information available
        Running method output_over_time
                Computing plot
        Running method read_len_over_time
                Computing plot
        Running method read_qual_over_time
                Computing plot
        Running method align_len_over_time
                No Alignment information available
        Running method identity_freq_over_time
                No identity frequency information available
        Running method barcode_counts
                Computing plot
        Running method channels_activity
                Computing plot
        Running method alignment_reads_status
                No Alignment information available
        Running method alignment_rate
                No identity frequency information available
        Running method alignment_coverage
                No Alignment information available
        Loading HTML template
        Rendering plots in d3js
        Writing to HTML file
Generating JSON report
        Running summary_stats_dict method
        Compute overall summary statistics
        Writing to JSON file
[user@cn4471 ~]$  ls -lt *json *.html 
-rw-r--r-- 1 user  user   28658 Apr 26 11:50 pycoQC_output.json
-rw-r--r-- 1 user  user 4345264 Apr 26 11:50 pycoQC_output.html
-rw-r--r-- 1 user  user    3426 Apr 26 11:49 pycoQC_config.json
End the interactive session:
[user@cn4471 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$