kb-python: A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing

kb-python: A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing

Quick Links

kb-python is a python package for processing single-cell RNA-sequencing. It wraps the kallisto | bustools single-cell RNA-seq command line tools in order to unify multiple processing workflows.

gget enables efficient querying of genomic reference databases to support the analysis of sequencing data.

ffq: A tool to find sequencing data and metadata from public databases.

References:

Nat Biotechnol 39, 813–818 (2021).

journal

Documentation

Important Notes

Module Name: kb-python (see the modules page for more information)
gget/0.3.11 and ffq/0.3.0 are installed with the module.

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=4g --gres=lscratch:10 
[user@cn3144 ]$ module load kb-python 
[+] Loading singularity
[+] Loading kb-python  0.27.3
[user@cn3144 ]$ kb
usage: kb [-h] [--list]  ...

kb_python 0.27.3

positional arguments:
  
    info      Display package and citation information
    compile   Compile `kallisto` and `bustools` binaries from source
    ref       Build a kallisto index and transcript-to-gene mapping
    count     Generate count matrices from a set of single-cell FASTQ files

options:
  -h, --help  Show this help message and exit
  --list      Display list of supported single-cell technologies

[user@cn3144 ]$ kb info
kb_python 0.27.3
kallisto: 0.48.0 (/opt/conda/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto)
bustools: 0.41.0 (/opt/conda/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools)
kb is a python package for rapidly pre-processing single-cell RNA-seq data. It
is a wrapper for the methods described in [2].

The goal of the wrapper is to simplify downloading and running of the kallisto
[1] and bustools [2] programs. It was inspired by Sten Linnarsson’s loompy
fromfq command (http://linnarssonlab.org/loompy/kallisto/index.html)

The kb program consists of two parts:

The `kb ref` command builds or downloads a species-specific index for
pseudoalignment of reads. This command must be run prior to `kb count`, and it
runs the `kallisto index` [1].

The `kb count` command runs the kallisto [1] and bustools [2] programs. It can
be used for pre-processing of data from a variety of single-cell RNA-seq
technologies, and for a number of different workflows (e.g. production of gene
count matrices, RNA velocity analyses, etc.). The output can be saved in a
variety of formats including mix and loom. Examples are provided below.

Examples are available at: https://www.kallistobus.tools/tutorials

References
==========
[1] Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal
probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525.
[2] Melsted, P., Booeshaghi, A. S., Liu, L., Gao, F., Lu, L., Min, K. H., da
Veiga Beltrame, E., Hjorleifsson, K. E., Gehring, J., & Pachter, L. (2021).
Modular and efficient pre-processing of single-cell RNA-seq. Nature
Biotechnology.

[user@cn3144 ]$ gget
usage: gget [-h] [-v] {ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb} ...

gget v0.3.11

positional arguments:
  {ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb}
    ref                 Fetch FTPs for reference genomes and annotations by species.
    search              Fetch gene and transcript IDs from Ensembl using free-form search terms.
    info                Fetch gene and transcript metadata using Ensembl IDs.
    seq                 Fetch nucleotide or amino acid sequence (FASTA) of a gene (and all isoforms) or transcript by Ensembl, WormBase or FlyBase ID.
    muscle              Align multiple nucleotide or amino acid sequences against each other (using the Muscle v5 algorithm).
    blast               BLAST a nucleotide or amino acid sequence against any BLAST database.
    blat                BLAT a nucleotide or amino acid sequence against any BLAT UCSC assembly.
    enrichr             Perform an enrichment analysis on a list of genes using Enrichr.
    archs4              Find the most correlated genes or the tissue expression atlas of a gene using data from the human and mouse RNA-seq database
                        ARCHS4 (https://maayanlab.cloud/archs4/).
    setup               Install third-party dependencies for a specified gget module.
    alphafold           Predicts the structure of a protein using a slightly simplified version of AlphaFold v2.1.0
                        (https://doi.org/10.1038/s41586-021-03819-2).
    pdb                 Query RCSB PDB for the protein structutre/metadata of a given PDB ID.

options:
  -h, --help            Print manual.
  -v, --version         Print version.

[user@cn3144 ]$ ffq
usage: ffq [-h] [-o OUT] [-l LEVEL] [--ftp] [--aws] [--gcp] [--ncbi] [--split] [--verbose] [--version] IDs [IDs ...]

ffq 0.3.0: A command line tool to find sequencing data from SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ / Biosample.

positional arguments:
  IDs         One or multiple SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ / Biosample accessions, DOIs, or paper titles

options:
  -h, --help  Show this help message and exit
  -o OUT      Path to write metadata (default: standard out)
  -l LEVEL    Max depth to fetch data within accession tree
  --ftp       Return FTP links
  --aws       Return AWS links
  --gcp       Return GCP links
  --ncbi      Return NCBI links
  --split     Split output into separate files by accession (`-o` is a directory)
  --verbose   Print debugging information
  --version   show program's version number and exit

[user@cn3144 ]$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf drosophila_melanogaster)
[user@cn3144 ]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. kb-python.sh) similar to the following example:

#!/bin/bash

module load kb-python
kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=6 --mem=4g kb-python.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resource s.

Create a swarmfile (e.g. kb-python.swarm). For example:

kb count -i index.idx -g t2g.txt -o out1/ -x 10xv3 read1_R1.fastq.gz read1_R2.fastq.gz
kb count -i index.idx -g t2g.txt -o out2/ -x 10xv3 read2_R1.fastq.gz read2_R2.fastq.gz

Submit this job using the swarm command.

swarm -f kb-python.swarm -g 4 -t 6 --module kb-python

where

-g #	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t #	Number of threads/CPUs required for each process (1 line in the swarm command file).
--module kb-python	Loads the kb-python module for each subjob in the swarm