From the KneadData user manual:
KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments. In these experiments, samples are typically taken from a host in hopes of learning something about the microbial community on the host. However, sequencing data from such experiments will often contain a high ratio of host to bacterial reads. This tool aims to perform principled in silico separation of bacterial reads from these "contaminant" reads, be they from the host, from bacterial 16S sequences, or other user-defined sources. Additionally, KneadData can be used for other filtering tasks. For example, if one is trying to clean data derived from a human sequencing experiment, KneadData can be used to separate the human and the non-human reads.
$KNEADDATA_TEST_DATA
Allocate an interactive session and run the program. Sample session:
Allocate an interactive session with sinteractive and use as shown below. In this case we will use test data that is an artificial mixture of 1M human exome reads and 1M environmental metagenomic reads. The 50% human reads is treated as an artificial contamination and removed:
[user@biowulf]$ sinteractive --mem=12g --cpus-per-task=8 --gres=lscratch:10 salloc.exe: Pending job allocation 33247354 salloc.exe: job 33247354 queued and waiting for resources salloc.exe: job 33247354 has been allocated resources salloc.exe: Granted job allocation 33247354 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job srun: error: x11: no local DISPLAY defined, skipping [user@nc3144]$ module load kneaddata [user@nc3144]$ cd /lscratch/${SLURM_JOB_ID} [user@nc3144]$ # copy test data [user@nc3144]$ cp $KNEADDATA_TEST_DATA/* . [user@nc3144]$ # the test data is 50% human, 50% environmental metagenomic data [user@nc3144]$ # and the read names are labelled accordingly [user@nc3144]$ zcat test_R1.fastq.gz \ | awk '/@meta/ {m++} /@human/ {h++} END {printf("human: %i\nmeta: %i\n", h, m)}' human: 1000000 meta: 1000000 [user@nc3144]$ # run kneaddata on the paired end test data [user@nc3144]$ kneaddata -i test_R1.fastq.gz -i test_R2.fastq.gz \ --reference-db $KNEADDATA_DB/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \ --output-prefix test --output test_out \ -p 2 -t 4 --run-fastqc-end [...snip...] Final output files created: /lscratch/33247583/test_out/test_paired_1.fastq /lscratch/33247583/test_out/test_paired_2.fastq /lscratch/33247583/test_out/test_unmatched_1.fastq /lscratch/33247583/test_out/test_unmatched_2.fastq [user@nc3144]$ # check the composition of human/metagenome reads in the cleand data [user@nc3144]$ cat test_out/test_paired_1.fastq \ | awk '/@meta/ {m++} /@human/ {h++} END {printf("human: %i\nmeta: %i\n", h, m)}' human: 18641 meta: 951886 [user@nc3144]$ exit [user@biowulf]$
So the 50% artificial contamination with human reads was reduced to 2%.
Create a batch input file (e.g. kneaddata.sh) similar to the following example:
#! /bin/bash module load kneaddata/0.7.0 || exit 1 cd /lscratch/${SLURM_JOB_ID} || exit 1 if [[ ! -e test_R1.fastq.gz ]]; then cp $KNEADDATA_TEST_DATA/* . fi rm -rf test_out kneaddata -i test_R1.fastq.gz -i test_R2.fastq.gz \ --reference-db $KNEADDATA_DB/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \ --output-prefix test --output test_out \ -p 2 -t 4 --run-fastqc-end cp -r test_out /data/$USER/important_project
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=8 --mem=12g --gres=lscratch:10 kneaddata.sh
Create a swarmfile (e.g. kneaddata.swarm). For example:
kneaddata -i sample1_R1.fastq.gz -i sample1_R2.fastq.gz \ --reference-db $KNEADDATA_DB/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \ --output-prefix sample1 --output sample1_out -p 2 -t 4 --run-fastqc-end kneaddata -i sample2_R1.fastq.gz -i sample2_R2.fastq.gz \ --reference-db $KNEADDATA_DB/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \ --output-prefix sample2 --output sample2_out -p 2 -t 4 --run-fastqc-end kneaddata -i sample3_R1.fastq.gz -i sample2_R2.fastq.gz \ --reference-db $KNEADDATA_DB/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \ --output-prefix sample3 --output sample3_out -p 2 -t 4 --run-fastqc-end
Submit this job using the swarm command.
swarm -f kneaddata.swarm -g 12 -t 8 --module kneaddata/0.7.0where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module kneaddata | Loads the kneaddata module for each subjob in the swarm |