pycoQC is a new tool to generate interactive quality control metrics and plots from basecalled nanopore reads or summary files generated by the basecallers Albacore, Guppy or MinKNOW. pycoQC has several novel features, including: 1) python support for creation of dynamic D3.js visualizations and interactive data exploration in Jupyter Notebooks; 2) simple command line interface to generate customizable interactive HTML reports; and 3) multiprocessing FAST5 feature extraction program to generate a summary file directly from FAST5 files.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive [user@cn4471 ~]$ module load pycoQC [+] Loading pycoQC 2.5.2 ...Copy sample data to the current folder:
[user@cn4471 ~]$ pycoQC -h usage: pycoQC [-h] [--version] [--summary_file [SUMMARY_FILE [SUMMARY_FILE ...]]] [--barcode_file [BARCODE_FILE [BARCODE_FILE ...]]] [--bam_file [BAM_FILE [BAM_FILE ...]]] [--html_outfile HTML_OUTFILE] [--json_outfile JSON_OUTFILE] [--min_pass_qual MIN_PASS_QUAL] [--min_pass_len MIN_PASS_LEN] [--filter_calibration] [--filter_duplicated] [--min_barcode_percent MIN_BARCODE_PERCENT] [--report_title REPORT_TITLE] [--template_file TEMPLATE_FILE] [--config_file CONFIG_FILE] [--skip_coverage_plot] [--sample SAMPLE] [--default_config] [-v | -q] pycoQC computes metrics and generates interactive QC plots from the sequencing summary report generated by Oxford Nanopore technologies basecallers * Minimal usage pycoQC -f sequencing_summary.txt -o pycoQC_output.html * Including Guppy barcoding file + html output + json output pycoQC -f sequencing_summary.txt -b barcoding_sequencing.txt -o pycoQC_output.html -j pycoQC_output.json * Including Bam file + html output pycoQC -f sequencing_summary.txt -a alignment.bam -o pycoQC_output.html optional arguments: -h, --help show this help message and exit --version show program's version number and exit -v, --verbose Increase verbosity -q, --quiet Reduce verbosity Input/output options: --summary_file [SUMMARY_FILE [SUMMARY_FILE ...]], -f [SUMMARY_FILE [SUMMARY_FILE ...]] Path to a sequencing_summary generated by Albacore 1.0.0 + (read_fast5_basecaller.py) / Guppy 2.1.3+ (guppy_basecaller). One can also pass multiple space separated file paths or a UNIX style regex matching multiple files (Required) --barcode_file [BARCODE_FILE [BARCODE_FILE ...]], -b [BARCODE_FILE [BARCODE_FILE ...]] Path to the barcode_file generated by Guppy 2.1.3+ (guppy_barcoder) or Deepbinner 0.2.0+. This is not a required file. One can also pass multiple space separated file paths or a UNIX style regex matching multiple files (optional) --bam_file [BAM_FILE [BAM_FILE ...]], -a [BAM_FILE [BAM_FILE ...]] Path to a Bam file corresponding to reads in the summary_file. Preferably aligned with Minimap2 One can also pass multiple space separated file paths or a UNIX style regex matching multiple files (optional) --html_outfile HTML_OUTFILE, -o HTML_OUTFILE Path to an output html file report (required if json_outfile not given) --json_outfile JSON_OUTFILE, -j JSON_OUTFILE Path to an output json file report (required if html_outfile not given) Filtering options: --min_pass_qual MIN_PASS_QUAL Minimum quality to consider a read as 'pass' (default: 7) --min_pass_len MIN_PASS_LEN Minimum read length to consider a read as 'pass' (default: 0) --filter_calibration If given, reads flagged as calibration strand by the basecaller are removed (default: False) --filter_duplicated If given, duplicated read_ids are removed but the first occurence is kept (Guppy sometimes outputs the same read multiple times) (default: False) --min_barcode_percent MIN_BARCODE_PERCENT Minimal percent of total reads to retain barcode label. If below, the barcode value is set as `unclassified` (default: 0.1) HTML report options: --report_title REPORT_TITLE Title to use in the html report (default: PycoQC report) --template_file TEMPLATE_FILE Jinja2 html template for the html report (default: ) --config_file CONFIG_FILE Path to a JSON configuration file for the html report. If not provided, looks for it in ~/.pycoQC and ~/.config/pycoQC/config. If it's still not found, falls back to default parameters. The first level keys are the names of the plots to be included. The second level keys are the parameters to pass to each plotting function (default: )") --skip_coverage_plot Skip the coverage plot in HTML report. Useful when using a reference file containing many sequences, i.e. transcriptome (default: False) Other options: --sample SAMPLE If not None a n number of reads will be randomly selected instead of the entire dataset for ploting function (deterministic sampling) (default: 100000) --default_config, -d Print default configuration file. Can be used to generate a template JSON file (default: False)Copy sample data to your current folder:
[user@cn4471 ~]$ cp $PYCOQC_DATA/* . [user@cn4471 ~]$ ls Albacore-1.2.1_basecall-1D-DNA_sequencing_summary.txt.gz Albacore-1.2.3_basecall-1D-RNA_sequencing_summary.txt.gz Albacore-1.7.0_basecall-1D-DNA_sequencing_summary.txt.gz Albacore-2.1.10_basecall-1D-DNA_sequencing_summary.txt.gz Albacore-2.1.10_basecall-1D-RNA_sequencing_summary.txt.gz Albacore-2.3.1_basecall-1D-RNA_sequencing_summary.txt.gz Guppy-2.1.3_basecall-1D_DNA_barcoding_summary.txt.gz Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz Guppy-2.1.3_basecall-1D-RNA_sequencing_summary.txt.gz Guppy-2.2.4-basecall-1D-DNA_sequencing_summary+barcode.txt.gz Guppy-basecall-1D-DNA_deepbinner_barcoding_summary.txt.gz Guppy-basecall-1D-DNA_sequencing_summary.txt.gzRun pycoQC on the sample data:
[user@cn4471 ~]$ pycoQC -f Guppy-2.1.3_basecall-1D-DNA_sequencing_summary.txt.gz -b Guppy-2.1.3_basecall-1D_DNA_barcoding_summary.txt.gz -o pycoQC_output.html -j pycoQC_output.json Check input data files Parse data files Merge data Cleaning data Discarding lines containing NA values 0 reads discarded Filtering out zero length reads 0 reads discarded Sorting run IDs by decreasing throughput Run-id order ['c4981b897c2bb47fed99916c19c9bd1bd43267a2'] Reordering runids Processing reads with Run_ID c4981b897c2bb47fed99916c19c9bd1bd43267a2 / time offset: 0 Cleaning up low frequency barcodes 0 reads with low frequency barcode unset Cast value to appropriate type Reindexing dataframe by read_ids 10,000 Final valid reads Loading plotting interface Found 10,000 total reads Found 9,065 pass reads (qual >= 7.0 and length >= 0) Generating HTML report Parsing html config file Running method run_summary Computing plot Running method basecall_summary Computing plot Running method alignment_summary No Alignment information available Running method read_len_1D Computing plot Running method align_len_1D No Alignment information available Running method read_qual_1D Computing plot Running method identity_freq_1D No identity frequency information available Running method read_len_read_qual_2D Computing plot Running method read_len_align_len_2D No Alignment information available Running method align_len_identity_freq_2D No identity frequency information available Running method read_qual_identity_freq_2D No identity frequency information available Running method output_over_time Computing plot Running method read_len_over_time Computing plot Running method read_qual_over_time Computing plot Running method align_len_over_time No Alignment information available Running method identity_freq_over_time No identity frequency information available Running method barcode_counts Computing plot Running method channels_activity Computing plot Running method alignment_reads_status No Alignment information available Running method alignment_rate No identity frequency information available Running method alignment_coverage No Alignment information available Loading HTML template Rendering plots in d3js Writing to HTML file Generating JSON report Running summary_stats_dict method Compute overall summary statistics Writing to JSON file [user@cn4471 ~]$ ls -lt *json *.html -rw-r--r-- 1 user user 28658 Apr 26 11:50 pycoQC_output.json -rw-r--r-- 1 user user 4345264 Apr 26 11:50 pycoQC_output.html -rw-r--r-- 1 user user 3426 Apr 26 11:49 pycoQC_config.jsonEnd the interactive session:
[user@cn4471 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$