High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
kallisto on Biowulf

kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment. On benchmarks with standard RNA-Seq data, kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. Pseudoalignment of reads preserves the key information needed for quantification, and kallisto is therefore not only fast, but also as accurate as existing quantification tools. In fact, because the pseudoalignment procedure is robust to errors in the reads, in many benchmarks kallisto significantly outperforms existing tools.


Loading the kallisto module environment

You must load the kallisto module environment before using any kallisto commands. This is done by using the steps illustrated in the following:

biowulf$
biowulf$ module purge                         ## clear module environment [optional]
biowulf$ module list
No modules loaded 
biowulf$ module available kallisto            ## show  available versions of kallisto
------------------------ /usr/local/lmod/modulefiles -------------------------
  kallisto/0.42.4  (D)
biowulf$
biowulf$ module load kallisto/0.42.4          ## load default  version of  kallisto
modulename is gcc/4.9.1
[+] Loading gcc 4.9.1 ...
modulename is hdf5/1.8.13
biowulf$
biowulf$ module list 

Currently Loaded Modules:
  1) gcc/4.9.1   2) hdf5/1.8.13   3) kallisto/0.42.4
$
biowulf$

Kallisto Environment Variables

To view the names of the kallisto environment variables assigned values by the command, "module load kallisto", use the command

biowulf$
biowulf$ module whatis kallisto
kallisto/0.42.4     : Sets up kallisto/0.42.4 with gcc/4.9.1 and hdf5/1.8.13 with environment variables:
kallisto/0.42.4     : KALLISTO_TEST => Directory for test input files
biowulf$ 

While the kallisto module is loaded, you can see the value of the environment variable with the command,

biowulf 
biowulf$ echo $KALLISTO_TEST
/usr/local/apps/kallisto/0.42.4/test
biowulf

Simple Kallisto Example

The kallisto release provides one simple example for each of the main kallisto commands:

The input data files are in the directory, "$KALLISTO_TEST".

Instructions used to run the examples

Note that if you want to run the examples yourself, there is no need to run them as a job on the biowulf cluster because the example "footprint" is very small.

1. Make a directory for your example test output

For shell programs in the /bin/bash family:

biowulf$ 
biowulf$ export KALLISTO_TEST_OUTPUT=$HOME/Kallisto_Test_Output
biowulf$ mkdir -p $KALLISTO_TEST_OUTPUT/Index
biowulf$ mkdir -p $KALLISTO_TEST_OUTPUT/Quant
biowulf$ mkdir -p $KALLISTO_TEST_OUTPUT/H5dump
biowulf$  

For shell programs in the /bin/tcsh family:

biowulf$ 
biowulf$ setenv KALLISTO_TEST_OUTPUT $HOME/Kallisto_Test_Output
biowulf$ mkdir -p $KALLISTO_TEST_OUTPUT/Index
biowulf$ mkdir -p $KALLISTO_TEST_OUTPUT/Quant
biowulf$ mkdir -p $KALLISTO_TEST_OUTPUT/H5dump
biowulf$  


2 Run example command "kallisto index"   

$ kallisto index

kallisto 0.42.4
Builds a kallisto index

Usage: kallisto index [arguments] FASTA-file

Required argument:
-i, --index=STRING          Filename for the kallisto index to be constructed 

Optional argument:
-k, --kmer-size=INT         k-mer (odd) length (default: 31, max value: 31)

biowulf$
biowulf$ target_index_file=$KALLISTO_TEST_OUTPUT/Index/transcripts.index
biowulf$ transcript_fasta_file=$KALLISTO_TEST/transcripts.fasta.gz
biowulf$ kallisto index --index=$target_index_file $transcript_fasta_file

[build] loading fasta file /usr/local/apps/kallisto/0.42.4/test/transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 23 contigs and contains 18902 k-mers 
biowulf$
biowulf$ ll $target_index_file
-rw-r--r-- 1 sandor staff 474446 Jul 16 14:15 /home/sandor/Desktop/Kallisto_Example_Output/Index/transcripts.index
biowulf$


3. Run example command "kallisto quant"    

$ kallisto quant
kallisto 0.42.4
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-l, --fragment-length=DOUBLE  Estimated average fragment length
                              (default: value is estimated from the input data)
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5

biowulf$ 
biowulf$ index_file=$KALLISTO_TEST_OUTPUT/Index/transcripts.index
biowulf$ ll $index_file
-rw-r--r-- 1 sandor staff 474446 Jul 16 14:15 /home/sandor/Desktop/Kallisto_Example_Output/Index/transcripts.index
biowulf$ paired_end_reads="$KALLISTO_TEST/reads_1.fastq.gz $KALLISTO_TEST/reads_2.fastq.gz"
biowulf$ ll $paired_end_reads
-rw-r--r-- 1 sandor staff 209968 May  8 23:14 /usr/local/apps/kallisto/0.42.4/test/reads_1.fastq.gz
-rw-r--r-- 1 sandor staff 210499 May  8 23:14 /usr/local/apps/kallisto/0.42.4/test/reads_2.fastq.gz
biowulf$ $output_dir=$KALLISTO_TEST_OUTPUT/Quant
biowulf$ kallisto quant --index=$index_file --output-dir=$output_dir $paired_end_reads

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 15
[index] number of k-mers: 18902
[index] number of equivalence classes: 22
[quant] finding pseudoalignments for the reads ... done
[quant] estimated average fragment length: 178.097
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 41 rounds

biowulf$ ll $KALLISTO_TEST_OUTPUT/Quant
total 40
-rw-r--r-- 1 sandor staff 25696 Jul 16 14:31 abundance.h5
-rw-r--r-- 1 sandor staff   589 Jul 16 14:31 abundance.tsv
-rw-r--r-- 1 sandor staff   424 Jul 16 14:31 run_info.json
biowulf$ 


4. Run example command "kallisto h5dump"  

$ kallisto h5dump
kallisto 0.42.4
Converts HDF5-formatted results to plaintext

Usage:  kallisto h5dump [arguments] abundance.h5

Required argument:
-o, --output-dir=STRING       Directory to write output to

The kallisto h5dump command can be run in a way similar to the other two kallisto commands. We skip this step here, because as shown above, the kallisto quant command while producing the coded "abundance.h5" data file also writes the text file "abundance.tsv" produced by the command "kallisto h5dump".

Figure 1. Example "abundance.tsv"
target_id	length	eff_length	est_counts	tpm
--------------------------------------------------------------------
NM_001168316	2283	2105.9		 164.133	 12856.9
NM_174914	2385	2207.9		1495.6		111741
NR_031764	1853	1675.9		 104.27		 10263.4
NM_004503	1681	1503.9		 332.001	 36416.5
NM_006897	1541	1363.9		 664		 80308.9
NM_014212	2037	1859.9		  55		  4878.11
NM_014620	2300	2122.9		 592.584	 46046.7
NM_017409	1959	1781.9		  47		  4351.04
NM_017410	2396	2218.9		  42		  3122.41
NM_018953	1612	1434.9		 227.995	 26210.8
NM_022658	2288	2110.9		4881		381434
NM_153633	1666	1488.9		 359.898	 39874.2
NM_153693	2072	1894.9		  72.5147	  6312.74
NM_173860	 849	 671.903	 962		236182
NR_003084	1640	1462.9		   0.00787013	     0.887453


Running a single batch job on Biowulf

Set up a batch script along the following lines.

#!/bin/bash 

echo "Running on $SLURM_CPUS_PER_TASK cores"

env

cd /data/$USER/mydir
module load cufflinks

cufflinks -p 4 inputFile

Submit to the batch system with:
$ sbatch --cpus-per-task=4 --mem=10g  myscript

The command above will allocate 4 cores and 10 GB of memory to the job.

You would, of course, modify these values to the needs of your job.

Running a swarm of jobs on Biowulf

Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:

cd /data/$USER/mydir1; cufflinks -p 4 inputFile
cd /data/$USER/mydir2; cufflinks -p 4 inputFile
cd /data/$USER/mydir3; cufflinks -p 4 inputFile
[...]   

Submit this job with

$ swarm -f cmdfile -t 4 -g 10 --module cufflinks

By default, each line of the command file above will run on one core of a node using up to 1 GB of memory. If each command requires more than 1 GB of memory, you must tell swarm the amount of memory required using the '-g #' flag to swarm. If each command requires more than 1 core, then use '-t #' flag to swarm. The above example request 10g and 4 cores per command.

Running an interactive job on Biowulf

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf]$ sinteractive -M 10 -c 4 
      salloc.exe: Granted job allocation 1528
slurm stepprolog here!

[user@pXXXX]$ cd /data/$USER/myruns

[user@pXXXX]$ module load cufflinks

[user@pXXXX]$ cufflinks command

[user@pXXXX] exit
slurm stepepilog here!
                      salloc.exe: Relinquishing job allocation 1528
salloc.exe: Job allocation 1528 has been revoked.

[user@biowulf]$ 

The command 'sinteractive' has several options:

$ sinteractive -h
Usage: sinteractive [-J job_name] [-p partition] [-c cpus] [-M mem | -m mem_per_cpu] [-x]

Optional arguments:
-J job name (default: my_interactive_job)
-p partition to run job in (default: interactive)
-c number of CPU cores required (default: 1)
-M memory (GB) required (default: 1)
-m memory (GB) per core required
-x enable X11 forwarding (default: disabled)


Documentation