kb-python is a python package for processing single-cell RNA-sequencing. It wraps the kallisto | bustools single-cell RNA-seq command line tools in order to unify multiple processing workflows.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=4g --gres=lscratch:10 [user@cn3144 ]$ module load kb-python [+] Loading singularity [+] Loading kb-python 0.27.3 [user@cn3144 ]$ kb usage: kb [-h] [--list]... kb_python 0.27.3 positional arguments: info Display package and citation information compile Compile `kallisto` and `bustools` binaries from source ref Build a kallisto index and transcript-to-gene mapping count Generate count matrices from a set of single-cell FASTQ files options: -h, --help Show this help message and exit --list Display list of supported single-cell technologies [user@cn3144 ]$ kb info kb_python 0.27.3 kallisto: 0.48.0 (/opt/conda/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto) bustools: 0.41.0 (/opt/conda/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools) kb is a python package for rapidly pre-processing single-cell RNA-seq data. It is a wrapper for the methods described in [2]. The goal of the wrapper is to simplify downloading and running of the kallisto [1] and bustools [2] programs. It was inspired by Sten Linnarsson’s loompy fromfq command (http://linnarssonlab.org/loompy/kallisto/index.html) The kb program consists of two parts: The `kb ref` command builds or downloads a species-specific index for pseudoalignment of reads. This command must be run prior to `kb count`, and it runs the `kallisto index` [1]. The `kb count` command runs the kallisto [1] and bustools [2] programs. It can be used for pre-processing of data from a variety of single-cell RNA-seq technologies, and for a number of different workflows (e.g. production of gene count matrices, RNA velocity analyses, etc.). The output can be saved in a variety of formats including mix and loom. Examples are provided below. Examples are available at: https://www.kallistobus.tools/tutorials References ========== [1] Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525. [2] Melsted, P., Booeshaghi, A. S., Liu, L., Gao, F., Lu, L., Min, K. H., da Veiga Beltrame, E., Hjorleifsson, K. E., Gehring, J., & Pachter, L. (2021). Modular and efficient pre-processing of single-cell RNA-seq. Nature Biotechnology. [user@cn3144 ]$ gget usage: gget [-h] [-v] {ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb} ... gget v0.3.11 positional arguments: {ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb} ref Fetch FTPs for reference genomes and annotations by species. search Fetch gene and transcript IDs from Ensembl using free-form search terms. info Fetch gene and transcript metadata using Ensembl IDs. seq Fetch nucleotide or amino acid sequence (FASTA) of a gene (and all isoforms) or transcript by Ensembl, WormBase or FlyBase ID. muscle Align multiple nucleotide or amino acid sequences against each other (using the Muscle v5 algorithm). blast BLAST a nucleotide or amino acid sequence against any BLAST database. blat BLAT a nucleotide or amino acid sequence against any BLAT UCSC assembly. enrichr Perform an enrichment analysis on a list of genes using Enrichr. archs4 Find the most correlated genes or the tissue expression atlas of a gene using data from the human and mouse RNA-seq database ARCHS4 (https://maayanlab.cloud/archs4/). setup Install third-party dependencies for a specified gget module. alphafold Predicts the structure of a protein using a slightly simplified version of AlphaFold v2.1.0 (https://doi.org/10.1038/s41586-021-03819-2). pdb Query RCSB PDB for the protein structutre/metadata of a given PDB ID. options: -h, --help Print manual. -v, --version Print version. [user@cn3144 ]$ ffq usage: ffq [-h] [-o OUT] [-l LEVEL] [--ftp] [--aws] [--gcp] [--ncbi] [--split] [--verbose] [--version] IDs [IDs ...] ffq 0.3.0: A command line tool to find sequencing data from SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ / Biosample. positional arguments: IDs One or multiple SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ / Biosample accessions, DOIs, or paper titles options: -h, --help Show this help message and exit -o OUT Path to write metadata (default: standard out) -l LEVEL Max depth to fetch data within accession tree --ftp Return FTP links --aws Return AWS links --gcp Return GCP links --ncbi Return NCBI links --split Split output into separate files by accession (`-o` is a directory) --verbose Print debugging information --version show program's version number and exit [user@cn3144 ]$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf drosophila_melanogaster) [user@cn3144 ]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. kb-python.sh) similar to the following example:
#!/bin/bash module load kb-python kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens) kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=6 --mem=4g kb-python.sh
Create a swarmfile (e.g. kb-python.swarm). For example:
kb count -i index.idx -g t2g.txt -o out1/ -x 10xv3 read1_R1.fastq.gz read1_R2.fastq.gz kb count -i index.idx -g t2g.txt -o out2/ -x 10xv3 read2_R1.fastq.gz read2_R2.fastq.gz
Submit this job using the swarm command.
swarm -f kb-python.swarm -g 4 -t 6 --module kb-pythonwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module kb-python | Loads the kb-python module for each subjob in the swarm |