CLARK uses discriminative k-mers to classify sequences. The classification is done by first analyzing a set of sequences with known classes and finding k-mers that allow discrimination between the know classes. The resulting database can then be used to classify new sequences. This can be used to, for example, classify short reads taxonomically at different levels (genus, species, ...) or to assign BACs to chromosome arms.
Three different algorithms are included in this package: (1) CLARK
- the original
algorithm. Can be quite memory hungry for large databases and kmer sizes.
(2) CLARK-l
- requires less memory at the cost of slightly reduced power.
(3) CLARK-S
- uses spaced kmers. Requires the most RAM and has higher
sensitivity.
In addition to the basic executables, a number of convenience scripts for carrying out common analyses are provided as well. See the README below as well as the CLARK documentation.
The Clark Readme is included here since it contains a lot of useful information that is not immediately available on the clark home page.
Readme (1.2.6.1)
== AUTHORS Rachid Ounit (1), Steve Wanamaker(2), Timothy J Close (2) and Stefano Lonardi (1). 1: Computer Science and Engineering department, University of California, Riverside CA 92521 2: Botany and Plant Sciences department, University of California, Riverside CA 92521 == ABOUT The problem of DNA sequence classification is central to several application domains in molecular biology, genomics, metagenomics and genetics. The problem is computationally challenging due to the size of datasets generated by modern sequencing instruments and the growing size of reference sequence databases. We present a novel method, called CLARK (CLAssifier based on Reduced K-mers), for supervised sequence classification based on discriminative k-mers. Somewhat unique among other metagenomic and genomic classification methods, CLARK provides a confidence score for its assignments which can be used in downstream analysis. We demonstrated the utility of CLARK on two distinct specific classification problems, namely (1) the assignment of metagenomic reads to known bacterial genomes, and (2) the assignment of BAC clones and transcript to chromosome arms (in the absence of a finished assembly for the reference genome). Three classifiers or variants in the CLARK framework are provided : CLARK (default): created for powerful workstation, it can require a significant amount of RAM to run with large database (e.g., all bacterial genomes from NCBI/RefSeq). This classifier queries k-mers with exact matching; CLARK-l : created for workstations with limited memory (i.e., "l" for light), this software tool provides precise classification on small metagenomes. Indeed, for metagenomics analysis, CLARK-l works with a sparse or ''light'' database (up to 4 GB of RAM) that is built using distant and non-overlapping k-mers. This classifier queries k-mers with exact matching; CLARK-S : created for powerful workstation exploiting spaced k-mers (i.e., "S" for spaced), this classifier requires a higher RAM usage than CLARK or CLARK-l, but it does offer a higher sensitivity. CLARK-S completes the CLARK series of classifiers. == GENERAL NOTES The performance of CLARK was introduced, as a poster presentation, at the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), September 20-23, 2014 at Newport Beach, CA. CLARK-S was presented at the 14th Workshop on Algorithms in Bioinformatics, Atlanta, GA, September 2015. == LICENCE The source code of CLARK is distributed under the GNU GPL License. CLARK and its variants (such as CLARK-l) are a free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. CLARK and its variants are distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details at http://www.gnu.org/licenses/ Copyright 2013-2017, Rachid Ounit (clark.ucr.help at gmail.com). == PUBLICATIONS The classifiers in the CLARK framework are part of the following peer-reviewed publications. Please cite these two papers to give full credit: Ounit, R., Wanamaker, S., Close, T. J., & Lonardi, S. (2015). CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC genomics, 16(1), 236. Ounit, R., & Lonardi, S. (2016). Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics, 32(24), 3823-3825. == RELEASE On 09/01/2014, the version 1.0 is available: This version of CLARK is written in C++ and shell script. It is designed for 64-bit OS, and can be executed on the latest Linux and Mac systems. On 02/20/2015, the version 1.1 is available: Compared to v1.0, this version is more efficient to store the database in disk (some non-necessary data were pruned, and thus a relative reduction of about 15% can be observed for the disk space needed). The user can also observe the program loads the database in RAM much faster than before. This version also allows to load the database in multithreaded-task. Finally, the v1.1 uses different settings for the sampling method used for the default mode, which leads to an increase of the classification speed of about 5%. On 04/15/2015, the version 1.1.1 is available: Compared to v1.1, this version is more efficient to load the database from disk to RAM. The relative reduction is about 16% in average. This improvement also implies an increase of the classification speed. This release also includes scripts to facilitate the classification of metagenomic samples (see section "CLASSIFICATION OF METAGENOMIC SAMPLES"). On 04/22/2015, the version 1.1.2 is available: This release includes scripts to produce the abundance estimation per target (i.e., the count and proportion of objects assigned to a target) with filtering possible on the confidence score. Also, the user can get the density of assigned objects per confidence score. The user can also pass to CLARK objects as compressed files (GZ format) directly to "classify_metagenome.sh". Finally, bugs fixes and code improvement are made. On 06/03/2015, the version 1.1.3 is available: This release extends existing features, simplifies output produced by the full mode (to reduce the disk spaced needed for the results file). Bugs fixes and code improvement are included. On 06/15/2015, the version 1.2.1-beta is available: This release contains the discriminative spaced k-mers (using multiple spaced-seed). This version is still in the beta version while features and code improvements on the RAM usage are on-going. Bug fixes and code improvement are included. On 12/11/2015, the version 1.2.2-beta is available: The speed and RAM usage of the full and spaced mode are significantly improved: i) the RAM usage is now scalable and the upper-bound is lower than the maximum usage needed in previous versions; ii) the speed is increased by a factor of 5 to 9, in singlethreaded tasks. To analyze results, a script to plot the distribution of the assignments per gamma score (as done with confidence score) is provided. In addition, the option to create Krona-compatible input files ("http://www.biomedcentral.com/1471-2105/12/385") is available. On 04/07/2016, the version 1.2.3 is available: The classifier CLARK-S is finalized. Scripts to generate the targets definition using the accession number instead of the GI number have been updated. Additional scripts have been added to facilitate the creation and changes of the customized databases. Code improvements and bug fixes are included. On 04/20/2017, the version 1.2.3.1 is available: A new script to summarize results from several reports files is provided. Copyright and email info are updated. On 08/21/2017, the version 1.2.3.2 is available: The script to summarize results (per report and per taxon identified) across several reports files has been improved/extended. Paired-end reads in Fasta files can be passed (not only in fastq format anymore). Code improvements and bug fixes are included. On 11/25/2017, the version 1.2.4 is available: In addition to the bacteria/archaea, viruses and human, a database of fungi is made available. A list of 29 reference genomes from RefSeq corresponding to known hospital-acquired infections and other nosocomial diseases represents the first basis of this new database. Code improvements and bug fixes are included. On 02/13/2018, the version 1.2.5 is available: In addition to the bacteria/archaea, viruses, fungi and human database, we extended the fungi database and added the plasmid, plastid and protozoa databases (from RefSeq complete genomes). Copyright update, code improvements and bug fixes are included. On 09/20/2018, the version 1.2.5.1 is available: A new script is added to print the distibution of the target-specific k-mers. An option is added to allow processing of very long reads for the full mode and for CLARK-S. Code improvements and bug fixes are included. On 02/20/2019, the version 1.2.6 is available: A new script is added to extract sequences identified to a specific taxon. The script "estimate_abundance.sh" allows now to output the results in the mpa format (tab-delimited format from MetaPhlAn).Code improvements and bug fixes are included. This versatile method allows the classification of objects sequences against a database of targets. Sequences of targets must be given by standard fasta and/or fastq files, or eventually text file (two-column format, where each line is, for example in the case of the file "target.txt" (using 6-mers): $ cat target.txt ATATAT 1234 GCGCGC 435 ATTTTT 234 ... ) == COMPATIBILITY BETWEEN VERSIONS Between v1.* and v1.0: The version 1.1 (or higher) does use a different algorithm when storing and loading the database, compared to v1.0. Therefore, databases files (TSK files) produced by the version 1.0 are not compatible with v1.1. This is why it is needed to rebuild the database using v1.1 (or newer version). When using v1.1 (or newer), three files are created for the database. Discriminative k-mers and their targets ID are stored respectively in the files of extension *.ky and *.lb. Sizes of all buckets of the hashtable are stored in the file of extension *.sz. == SOFTWARE & SYSTEM REQUIREMENTS 1) EXTERNAL TOOLS & C++ COMPILER VERSION unlike most of other metagenomic and genomic classifiers, CLARK does not require Uny tool from the BLAST family, Jellyfish or other external tool. The main requirement is a 64-bit operating system (Linux or Mac), and the GNU GCC to compile version 4.4 or higher. Multithreading operations are assured by the openmp libraries. If these libraries are not installed, CLARK will run in single-threaded task. CLARK is expected to successfully run on recent Mac and Linux OS. 2) MEMORY CLARK can be RAM consuming, especially when loading large database(s). Latest versions since v1.1.1 require about 58 GB in RAM to load the database in default mode (built from bacteria genomes, default NCBI/RefSeq) and about 156 GB to build that database, using 31-mers. Using 20-mers instead, it requires 49 GB to load the database (and 107 GB to build the database). About CLARK-S, the memory usage during the classification is the highest (~101 GB) because the program loads in memory several sets of spaced k-mers (from several spaced seed). For the space required in disk, CLARK needs about 36 GB to store the database built for bacteria (at the genus level). If memory is a concern to you and you have small metagenomes then please consider using the RAM-light version, CLARK-l. This tool is designed to run on 4 GB RAM laptop, thus the memory consumption is the main objective of this tool. CLARK-l can load data from bacteria and viruses databases within 4 GB. If it can't then the user should change the parameter value for the gap (see "-g " option, in "MANUAL & OPTIONS") and set the iteration to a value higher than the default, 4. IMPORTANT NOTE: Results given by CLARK-l should be seen as a draft (or first order approximation) of results you would get by running CLARK or CLARK-S. == INSTALLATION First, download the zipped package of the latest version (v1.2.5.1), available from the CLARK webpage ("Download tab"), http://clark.cs.ucr.edu. Second, uncompress the tar.gz file ("tar -xvf CLARKV1.2.5.1.tar.gz"), then go to the sub-directory "CLARKSCV1.2.5.1" and execute the installation script ("./install.sh"). The installation is done! You can now run CLARK and any of the provided scripts. The installer builds binaries (CLARK, CLARK-l and CLARK-S, in the folder "exe" in CLARKSCV1.2.5.1). == SCRIPTS After the installation, you can also notice that several scripts are available. Especially: - set_targets.sh and classify_metagenome.sh: They allow you to define your database and to classify your metagenomes against it. set_targets.sh can be called indefinitely to build several database(s) (downloaded from NCBI or available "locally" in your disk). Note the last call of set_targets.sh defines the "working database", the database that classify_metagenome.sh and other scripts use as reference for their computations. - estimate_abundance.sh: It computes the abundance estimation (count/proportion of objects assigned to targets). - evaluate_density_confidence.sh and evaluate_density_gamma.sh: These two scripts take in input one or several CLARK results (containing confidence/gamma scores) and output/plot the density of assignments per confidence score (or gamma score). - buildSpacedDB.sh: It creates the sets of discriminate spaced k-mers from the selected databases. - clean.sh: This script will delete permanently all data related (generated and downloaded) of the database directory defined in set_targets.h. - resetCustomDB.sh: It resets the targets definition with sequences (newly added/modified) of the customized database. Any call of this script must be followed by a run of set_target.sh. - updateTaxonomy.sh: To download the latest taxonomy data (taxonomy id, accession numbers, etc.) from the NCBI website. - makeSummaryTables.sh: To build three tables summarizing results from one or several CLARK reports file. Given a rank r and a minimum abundance level (i.e., the ratio between number of reads classified to that taxon and the total number of reads in the report file) l, expressed in percent, passed in parameters, this script produces the following reports: => "TableSummary_per_Report.csv": For each report file, this table indicates the number of reads, the number of assigned or classified reads, the alpha-diversity (computed using the Shannon-Index on the relative proportion), the minimum abundance level used, the proportion of reads classified in percentage, and the r most dominant taxa/organisms with abundance higher than l (they are ordered by decreasing order of the abundance). The taxa are reported with scientific name and NCBI taxonomy id and per decreasing order of the proportion. => "TableSummary_per_Taxon.csv": For each taxon/organism identified across all the report files, this table indicates the taxonomy id of the taxon, the domain of the taxon, the minimum abundance level used, the number of reports the taxon was found, some statistics about the abundance of the taxon (i.e., minimum, maximum, average and standard-deviation in percent). => "TableSummary_HitCount.csv": This table provides the hit count distribution of the targets identified per report, as well as the total number of reads, the total number of reads classified and the alpha-diversity. - getTargetsKmers_distribution.sh: To print out the distribution of the target-specific k-mers in a file entitled "targets.distribution.csv" in the working database. The parameters for this script are the key parameters used for creating the database, i.e., the k-mer length and the minimum k-mers frequency (default 0). - extractSequences.sh: To extract/filter from the input data, the sequences identified to a specific taxon once the classification is done. This can be used for downstream analysis (analysis of contaminants, genome assembly, etc.). Parameters are: The taxonomy id (of the taxon to look up), the address of the file containing sequences (fasta/fastq format), the address of the CLARK results file (csv file), and the minimum thresholds for the gamma value and the confidence score (in the case the CLARK results file contains these statistics). == INTRODUCTION ON CLARK USAGE To introduce how to use CLARK, we first begin with some definitions: - target: a reference sequence or a set of reference sequences (e.g., reads, contigs, assemblies, etc.) of any length. In metagenomics, for a particular taxonomy rank r, any target can be a genome, or genomes of a particular taxon defined at the level r. see examples in next sections). In genomics, a target can be reads or the assemblies of a chromosome (or any other genomic region of a genome). Thus, a target is defined by the context of the classification. - object: a sequence (e.g., a read, a contig, a scaffold/assembly, etc.), of any length, to be classified against a set of targets defined at the same taxonomy rank (see examples in next sections). In general, target or object, sequences can be fasta or fastq files (or even spectrum), see details in next sections. We provide examples of how to classify a set of objects against a set of targets. The general rule is the user must define the targets definitions (i.e., the (reference sequence) <-> (target ID) association), pass objects to be classified, files to store results and a directory to store the database (of target-specific k-mers). This database is needed, because time-consuming computations are needed to find and saved in disk target-specific k-mers. Once these data are saved in disk, it will be loaded directly and quickly if CLARK needs them again. We describe in detail and also provide examples in next sections. All options/parameters available are detailed in the section "MANUAL & OPTIONS". For metagenomics, scripts are provided to download/preprocess databases from NCBI (i.e., genomes related to Bacteria, Viruses and Human), see section "CLASSIFICATION OF METAGENOMIC SAMPLES". == HOW TO CHOOSE THE K-MER LENGTH ? CLARK is a k-mer based classification method, and the value of k is critical for the classification accuracy and speed. The user should use any value between 19 and 23 in order to get high sensitivity. For high sensitivity, we recommend k = 20 or 21 (along with the full mode). However, if the precision and the speed are the main concern, the user should use any value between 26 and 32. However, the highest is k, the higher is the RAM usage. As a good tradeoff between speed, precision, and RAM usage, we recommend k = 31 (along with the default/express mode). The default k-mer length is 31. In the version 1.*, CLARK supports any k-mer length up to 32. == MANUAL & OPTIONS In this section, we describe options directly available with the CLARK executable (binary file in the "exe" directory). CLARK offers several options to run the classification. A typical command line to run CLARK (or CLARK-l/CLARK-S) looks like: $ ./CLARK -k -T -t -D -O -o -R -m -n ... Definitions of parameters: -k , k-mer length: integer, >= 2 and <= 32 (in version 1.*). The default value for this parameter is 31, except for CLARK-l (it is 27). For CLARK-S, k is set to 31 and the maximum number of mismatches is set to 9. -T , The targets definition is written in fileTargets: filename. This is a two-column file (separated by space, comma or tab), such that for each line: column 1: the filename of a reference sequence column 2: the target ID (taxon name, or taxonomy ID, ...) of the reference sequence (See example below). This parameter is mandatory. -t , minimum of k-mer frequency/occurence for the discriminative k-mers: integer, >=0. The default value is 0. For example, for 1 (or, 2), the program will discard any discriminative k-mer that appears only once (or, less than twice). -D , Directory of the database : pathname. This parameter is mandatory. -O , file containing objects: filename. This parameter is mandatory. -P , Paired-end fastq files: filenames. -o , minimum of k-mer frequency or occurence in objects: integer, >=0. This option is only available in the spectrum mode (-m 3). -R , file to store results: filename. Results are stored in CSV format in the file .csv (the extension ".csv" is automatically added to the filename). This parameter is mandatory. -m , mode of execution: 0,1,2 or 3 0 (full): To get detailed results, confidence scores and other statistics. This mode loads all discriminative (contiguous/spaced) k-mers. This mode is available for CLARK, CLARK-S and CLARK-l. 1 (default): To get results summary and perform best trade-off between classification speed, accuracy and RAM usage. This mode loads half discriminative (contiguous/spaced) k-mers. This mode is available for CLARK and CLARK-l. 2 (express): To get results summary with the highest speed possible. This mode loads all discriminative (contiguous/spaced) k-mers. This mode is available for CLARK, CLARK-S and CLARK-l. 3 (spectrum):To classify data that are given in spectrum form (i.e., sequence data are in a two-column file, ). This mode is available for CLARK and CLARK-l. -n , number of threads: integer >= 1. The program runs in parallel for the classification and, with the option '--ldm' on, the loading of the database into RAM. -g , "gap" or number of non-overlapping k-mers to pass when creating the database (for CLARK-l only). The bigger the gap is, the lower the RAM usage is. The default value is 4, but the user can increase this value if needed to reduce the RAM usage but this will degrade the sensitivity. -s , sampling factor value (for CLARK/CLARK-S only). If you want the program to load in memory only half the discriminative (contiguous/spaced) k-mers then use "-s 2". If you want a third of these k-mers then use "-s 3", etc. The higher this factor is, the lower the RAM usage is. The higher this factor is, the higher the classification speed/precision is. However, our experiments show that the sensitivity can be quickly degraded, especially for values higher than 3. In the default mode, this factor is set to 2 because it represents a good trade-off between speed, accuracy and RAM usage. --long, to indicate that the objects files contains very long/large sequences (e.g., long contig from genome assembly, long sequencing reads from Nanopore or Pacbio, etc.) and allocated more memory accordingly. This option is only for running the full mode or running CLARK-S, in the case of sequences up to 50 Mbp. --tsk, (to request a detailed creation of the database (target specific k-mers files). For each target, the program creates a file containing all its specific k-mers. BEWARE: This option requires a high amount of disk space to complete.) This option is no more supported. --ldm, to request the loading of database file by memory mapped-file. This option accelerates the loading time but it will require an additional amount of RAM significant. This option also allows to load the database in multithreaded-task. --kso, to request a preliminary k-spectrum analysis of each object (for mode 3 only). This analysis gives intervals of k-mers frequency for which k-mers are assumed to be informative of the object. This interval is reported in the results file. --extended to request an extended output (result file), only for the full mode. When a filename is required, we recommend absolute paths. == LOW-LEVEL DESCRIPTION & EXAMPLES This section describes main features and options for running directly the executable CLARK (or CLARK-l/CLARK-S) built (in the folder "exe") after the installation. If you want to use CLARK as a metagenome classier, we recommend you use the provided scripts for that purpose and detailed in the section "CLASSIFICATION OF METAGENOMIC SAMPLES". However, we recommend the reader to go through this section to become familiar with CLARK's options/parameters. If you work directly with CLARK and CLARK-l files (e.g., to classify BAC/transcript to chromosomes) then we recommend you copy them to your working directory. In this section, the working directory is /home/user1/bin/ and the binaries CLARK, CLARK-l and CLARK-S have been copied in it. Indeed, $ pwd /home/user1/bin/ $ ls CLARK CLARK-l CLARK-S $ We describe in the following paragraph 1) how to create the targets definition with examples for it, in the case CLARK is set to classify metagenomic samples. Then, we describe how to set different modes, use paired-end reads and run multiple threads. Finally, we explain how to use CLARK for assignning BAC/transcript to chromosome-arms/ centromeres. 1) Generalities, and Default Mode: - Say, for simplicity, the user have ten genomes in its database and each genome is stored into one fasta file and is located in the current working directory (say, genome_1.fa, genome_2.fa, ..., genome_10.fa). Say that the first three genomes belong to the same species S1, the next three genomes belong to the same species S2, and the last four genomes belong to S3. And finally, say that the first six genomes belong to the same genus G1, and the last four genomes belong to the same genus G2. - Sequences (or objects) to be classified are three short single-reads all stored in one fasta file, "objects.fa". First, the user must define targets for the classification (the "targets_addresses.txt" file). For metagenomics, for a classification at the species-level, then the user must define targets accordingly. So, at the species level, the targets_addresses.txt file is: $ cat targets_addresses.txt /home/user1/bin/genome_1.fa S1 /home/user1/bin/genome_2.fa S1 /home/user1/bin/genome_3.fa S1 /home/user1/bin/genome_4.fa S2 /home/user1/bin/genome_5.fa S2 /home/user1/bin/genome_6.fa S2 /home/user1/bin/genome_7.fa S3 /home/user1/bin/genome_8.fa S3 /home/user1/bin/genome_9.fa S3 /home/user1/bin/genome_10.fa S3 $ In this case, there are three distinct targets (S1, S2 and S3) that the program will use for the classification. Defining targets at the genus-level would be: $ cat targets_addresses.txt /home/user1/bin/genome_1.fa G1 /home/user1/bin/genome_2.fa G1 /home/user1/bin/genome_3.fa G1 /home/user1/bin/genome_4.fa G1 /home/user1/bin/genome_5.fa G1 /home/user1/bin/genome_6.fa G1 /home/user1/bin/genome_7.fa G2 /home/user1/bin/genome_8.fa G2 /home/user1/bin/genome_9.fa G2 /home/user1/bin/genome_10.fa G2 $ In this case, there are 2 distinct targets (G1 and G2). Results and performance of CLARK have been analyzed at the genus and the species level (see the manuscript). In all these examples, each genome file is given by the absolute path (recommended), however the user can also use relative path. Since files are in the working directory (where the binaries are), we can simply have: $ cat targets_addresses.txt ./genome_1.fa G1 ./genome_2.fa G1 ./genome_3.fa G1 ... $ Naturally, G1 and G2 are just labels that the user is free to choose. The user can rather use the full genus names (e.g., Staphylococcus and Streptococcus instead of G1, and G2), or, as recommended, the taxonomy ID (i.e., 1279 and 1301 resp.). The definition of targets is similar for genomics (the reader can see the paragraph "6) Chromosome arms and Centromeres"). Defining targets for genomics or metagenomics is equivalent. Second, based on the definition of the targets, CLARK builds a database. This database needs a directory to be created. We recommend the user to create a specific directory for each targets definition. For this example, we use the directory "./DBD/". This directory can be anywhere in your disk, and independantly to the directory you chose for the source code. Third, objects are short reads present in the file "objects.fa". Then, we want the classification results to be stored in the file named "results". The results format is CSV (see "RESULTS FORMAT" for more detail). If a file "results.csv" already exists then its content will be erased and replaced. Finally, to run CLARK, using 31-mers, in default mode, run: $ ./CLARK -k 31 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results Or using 20-mers: $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results Note that the results file will have the extension "csv". If CLARK is run for the first time on the defined targets and the given k-mer length, then CLARK builds first the database, which gathers all Target Specific K-mers (TSK). If CLARK is run again on the same targets and same k-mer length, then it will directly load TSK files previously produced. Once the classification is done, the file results.csv is created. 1.1) Exclusion of discriminative k-mers of low frequency in the database The user can request the removal of discriminative k-mers that have a frequency equal or lower than a certain threshold (like 1, or 2, or, ...). This can be done by using the option "-t", for which the default value is 0 (i.e., "-t 0"). For example, to exclude discriminative k-mers that "appear" only once, the command line from the last example in 1) is : $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 1 To exclude discriminative k-mers that appear twice or less: $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 2 See the "MANUAL & OPTIONS" section for more info. 1.2) Exclusion of k-mers of low frequency in the objects Similarly, the user can also request to ignore k-mers in objects that have a frequency equal or lower than a certain threshold (option "-o"). In version 1*, this option is available ONLY for the spectrum mode (mode 3), see "MANUAL & OPTIONS". 2) Fast, Full and Spectrum Mode: Previous instructions used the default mode of CLARK (i.e., "-m 1"). To use the "express" mode to get results more quickly, then simply add the option "-m 2". For the "full" mode that will generate detailed results (including confidence scores), add "-m 0". For example, using 20-mers, to run in express mode: $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 2 In full mode: $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 0 Note that $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 1 is equivalent to $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results In the case objects are given as spectrum (a two-column file), the user should indicate it with the option "-m 3": $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 3 IMPORTANT NOTES: Whatever you use CLARK, CLARK-l or CLARK-S: - The full mode (i.e., "-m 0") loads all discriminative k-mers in RAM and provides confidence score for all assignments. Thus, it offers high sensitivity. If your data are in the spectrum format, then the spectrum mode (-m 3) is a solution for you. - If your primary concern is the speed, and do not need detailed results, nor confidence scores, then use the "express" mode (-m 2), the fastest mode. For CLARK or CLARK-l: - If you do not know what mode to use, then use the default mode (-m 1). It loads in RAM about half discriminative k-mers stored in disk. Thus, the RAM usage is lower than that of other modes, but its sensitivity is not the highest. However, the precision and speed are high, and the default mode is a good compromise between speed, RAM usage and high precision/sensitivity. Finally: - If you analyze a metagenomic sample from a poorly known microbial habitat (i.e., the default RefSeq database is likely not to contain a large fraction of genomes of organisms present in your sample), for example, sea water, etc. then use CLARK-S with full mode. CLARK-S uses multiple spaced k-mers instead of contiguous k-mers, and it produces detailed results. However, this variant is the most RAM-consuming. Unlike CLARK, results produced by CLARK-S are already filtered (i.e., assignments with confidence score < 0.75 or gamma score < 0.06 are rejected). However, you can use a stricter filtering to get more precise results (see option of the script estimate_abundance.sh). 2.1) Running CLARK-S: The current release exploits spaced k-mers of length 31 and weight 22. Before classifying your metagenomic sample, the database of discriminative 31-mers (e.g., from bacterial genomes) and then the database of discriminative spaced 31-mers (using the script "buildSpacedDB.sh") must be created. Step 0 (Set the database and its directory "DBD"): $ ./set_targets.sh ./DBD/ bacteria --phylum (or $ ./set_targets.sh ./DBD/ bacteria --species) where "DBD" is the directory where you want to store bacteria genomes and create the database. Step 1 (Build the database of discriminative 31-mers): If the database files of 31-mers are already created for the database and the rank specified in step 0 then you can skip this step. This can be done by running any classification with k=31, for example: $ ./classify_metagenome.sh -O sample.fa -R result where sample.fa is some fasta file data. This operation is only needed to create the database of discriminative 31-mers. Step 2 (Create the databases of discriminative spaced k-mers): $ ./buildSpacedDB.sh This task will take several hours to create discriminative spaced k-mers for all the 3 spaced seeds (i.e., about 6 to 7 hours). This operation will write about 60GB of data on disk. Finally, to classify your metagenome (for example, sample.fq) run: $ ./exe/CLARK-S -T ./targets_addresses.txt -D ./DBD/ -O ./sample.fq -R ./result -m 0 or $ ./classify_metagenome.sh -O sample.fq -R result --spaced Note: If you decide to work with a different taxonomy rank and/or database then you will need to repeat the step 0 and step 1. 3) Paired-end reads: CLARK accepts paired-end reads (fastq/fasta files) as objects. Indicate it with the option "-P" and the two filenames (CLARK will concatenate paired reads with matching ID). An example of command line, using default mode, is: $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -P ./sample1.fastq ./sample2.fastq -R paired.results 4) Multi-Objects (or multiple sample files): The program can process several sample files at the same time (i.e., no need to reload the database or restart the program for each sample). To classify several fasta/fastq files under the same settings (e.g., mode, threads, etc.), it is possible to pass all of them conveniently (i.e., without the need to reload the database for each file). Say the user has 3 fasta files, located in the current working directory, ./objects_1.fa, ./objects_2.fa and ./objects_3.fa. To pass these three files at once, create a file (e.g., "objects.txt") containing filenames of these fasta files: $ cat objects.txt ./objects_1.fa ./objects_2.fa ./objects_3.fa $ The objects.txt file is like a "meta-objects" file. Then, create a meta-results file (called "results.txt") such that the i-th line in it indicates the filename to store results of the i-th fasta file indicated in "objects.txt": $ cat results.txt ./results_1 ./results_2 ./results_3 $ ./results_1.csv will contain results related to ./objects_1.fa, ..., and ./results_3.csv will contain results related to ./objects_3.fa. Indeed, once the classification is done, results_1.csv, results_2.csv and results_3.csv are created. This way assures that the user can pass any amount of objects files in a scalable fashion. Note: the "results.txt" file can be just a duplicate (or a simple copy) of "objects.txt". Since the program creates results files with extension "csv", so you can create these results files by using the same filename of the fastq/fasta file as prefix. For example, if you run: $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./objects.txt The program will store the results files as ./objects_1.fa.csv, ./objects_2.fa.csv and ./objects_3.fa.csv Same for paired-end reads: $ ./CLARK -k 20 -T ./targets_addresses.txt -O ./DBD/ -P ./paired1.txt ./paired2.txt -R ./results.txt where files paired1.txt aand paired2.txt contain addresses of paired-end reads: $ cat paired1.txt ./sample1_R1.fq ./sample2_R1.fq ./sample3_R1.fq $ cat paired2.txt ./sample1_R2.fq ./sample2_R2.fq ./sample3_R2.fq $ cat results.txt ./results_1 ./results_2 ./results_3 Once computations done, results for the paired-end reads for each of the three samples are stored in: results_1.csv, results_2.csv and results_3.csv. 5) Multi-threading: CLARK can exploit parallel threads to increase the computations speed. The option to add is "-n ". For example, using 12 threads: $ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./results.txt -n 12 6) Chromosome arms and Centromeres We understand that users can work with chromosome arms as targets. In this case, centromeres can be inferred by CLARK. To allow that, the user must indicate in the targets_addresses.txt file sequences that can be related to centromeres. For example, in the case of barley, consider 6 chromosome arms defined by 6 files (./chr_arm1L.fa, ./chr_arm1S.fa, ..., ./chr_arm3L.fa, ./chr_arm3S.fa). If the user does not want to define centromeres, then targets_addresses.txt would look like: $ cat targets_addresses.txt ./chr_arm1L.fa 1HL ./chr_arm1S.fa 1HS ./chr_arm2L.fa 2HL ./chr_arm2S.fa 2HS ./chr_arm3L.fa 3HL ./chr_arm3S.fa 3HS $ If the user wants the program to infer centromere of 2H only, then it will indicate the centromere name for it (2HC) at each target related to the chromosome 2H (so ./chr_arm2L.fa, and ./chr_arm2S.fa) on the same line: $ cat targets_addresses.txt ./chr_arm1L.fa 1HL ./chr_arm1S.fa 1HS ./chr_arm2L.fa 2HL 2HC ./chr_arm2S.fa 2HS 2HC ./chr_arm3L.fa 3HL ./chr_arm3S.fa 3HS $ If all centromeres must be inferred, then targets_addresses.txt would look like: $ cat targets_addresses.txt ./chr_arm1L.fa 1HL 1HC ./chr_arm1S.fa 1HS 1HC ./chr_arm2L.fa 2HL 2HC ./chr_arm2S.fa 2HS 2HC ./chr_arm3L.fa 3HL 3HC ./chr_arm3S.fa 3HS 3HC $ When defining chromosome arms labels, we strongly recommend that the user follows the naming convention used of this example: For the N-th chromosome, "NS" is the label for the short arm and "NL" is the label for the long arm and (same prefix), so "NC" is the label for the centromere. IMPORTANT NOTE: Examples of command lines above using fasta files are valid as well for fastq files, except for the paired-end reads case. == CLASSIFICATION OF METAGENOMIC SAMPLES We provide several scripts to facilitate the classification in the context of metagenomics. CLARK can preprocess databases of bacteria, viruses, plasmid, plastid, protozoa, fungi or human (downloaded from NCBI/RefSeq) or a customized set of genomes. First, we present here two scripts, "set_targets.sh" and "./classify_metagenome.sh" that work together. Second, we present the script to get the abundance estimation (count and proportion for each target identified), "estimate_abundance.sh", from one or several results file(s). 1) Setting and classification 1.1) Step I: Setting targets After the installation (./install.sh), the user must create a directory to store all reference sequences (bacteria, viruses, plasmid, plastid, protozoa, fungi, human and custom). For all our examples below, we name this directory path in a generic way for clarity. This directory can be anywhere in your disk(s). Then, the user must indicate what database(s) to consider for the classification among bacteria, viruses, ..., fungi, human and/or custom. For example, only bacteria genomes: $ ./set_targets.sh bacteria To work with bacteria, viruses and human: $ ./set_targets.sh bacteria viruses human To work with bacteria, viruses, fungi and human: $ ./set_targets.sh bacteria viruses fungi human To classify against a custom database: The user will need to copy/move the sequences (fasta files with accession numbers in the header, i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ...", and one fasta file per reference sequence) in the directory "Custom", inside , and then run set_targets. In order words, the user must: (1) create the directory "Custom" inside (if it does not exist yet); (2) copy/move the sequences of interest in the Custom; (3) run: $ ./set_targets.sh custom or for example, when combining the bacteria genomes with the Custom sequences: $ ./set_targets.sh bacteria custom In general case (when the user selects bacteria, viruses, ..., fungi and/or human), if the directory is empty, then the script will download all the selected database(s) and also data of the taxonomy tree, from NCBI. Once the sequences are found or downloaded in the directory, the script will build the targets for a given taxonomy rank. The default taxonomy rank is species. To use a different taxonomy rank, for example, genus, the command line is (from the example selecting bacteria, viruses and human): $ ./set_targets.sh bacteria viruses human --genus In the current release, the user can choose between six ranks (species to phylum): --species (the default value), --genus, --family, --order, --class or --phylum. Indeed, the strength of CLARK is to be able to classify quickly and accurately metagenomic samples by using only one taxonomy rank. So as a general rule when classifying metagenomic reads: Consider first the genus or species rank, then if a high proportion of reads cannot be classified, reset your targets definition at a higher taxonomy rank (e.g., family or phylum). Once set_targets.sh is finished, the user can proceed to the step II. However, if the user wants to modify the selected databases and/or taxonomy rank, then he/she will need to run again set_targets.sh with updated parameters before proceeding to step II. 1.2) Step II: Running the classification The script to run the classification of a metagenomic sample against the database(s) previously set in step I is "classify_metagenome.sh". For your convenience, this script runs the executable CLARK, CLARK-l or CLARK and allows you to pass few parameters. For example, say objects to be classified are reads in "sample.fa" (e.g., located in the current directory), and results to be stored in "result.csv". A basic command line is: $ ./classify_metagenome.sh -O ./sample.fa -R ./result As explained in the section "MANUAL & OPTIONS", thanks to identifiers "-O" and "-R", the script will pass the objects file "sample.fa" and results will be stored in "./result.csv". Objects are classified against the targets and the taxonomy rank defined by the last execution of ./set_targets.sh. IMPORTANT NOTES: - The targets definition is automatically passed to CLARK in step II. The file has been computed by set_targets.sh. - The script set_targets.sh assumes that each reference file from bacteria, ..., fungi or custom database contains an accession numberr (in the RefSeq records format: i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ..." ). If an accession number is missing in a file then this file will not be used for the classification. - set_targets.sh also maps the accession number found in each reference sequence to its taxonomy ID based on the latest NCBI taxonomy data. If a mapping cannot be made for a given sequence, then it will NOT be counted and excluded from the targets definition. The total number of excluded files is prompted in the standard output, and all files that have been excluded are reported in the file "files_excluded.txt" (located in the specified database directory (i.e., "./DBD/"). If some files are excluded, then it probably means that they have been removed for curations for example (visit the RefSeq FAQ webpage) or maybe because your local taxonomy information are no nore up-to-date (see next point). - You can update your local taxonomy database thanks to the script "updateTaxonomy.sh" You can use this script before running set_targets.sh. - If the user wants to work with a different customized database (for example, by removing or adding more sequences of interest in the Custom folder) then the targets definition must be reset. We made it simple with the script "resetCustomDB.sh": After the sequences in the Custom folder have been updated, just run: $ ./resetCustomDB.sh Then, run set_target.sh with the desired settings. - The database files (*.ky, *.lb and *.sz) will be created inside a subdirectory of the specified database directory in step I (i.e., "./DBD/") by classify_metagenome.sh. - The default values (the k-mer length, the mode, the number of threads, etc.) are used if not specified by the user, just like indicated in the previous section. - The script classify_metagenome.sh still allows you to pass customized parameters and options, similarly to the previous section. classify_metagenome.sh follows options defined in "MANUAL & OPTIONS"(see below some examples). So you can change the k-mer length, the number of parallel threads, mode, etc. We present below some examples of customized classification using classify_metagenome.sh. a) To use 20-mers (instead of 31-mers): $ ./classify_metagenome.sh -O ./sample.fa -R ./result -k 20 b) To request the full mode: $ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0 c) To classify in full mode multiple sample files (single-end reads): $ ./classify_metagenome.sh -O ./samples.txt -R ./samples.txt -m 0 where, the file "samples.txt" contains the addresses of all the sample files to be run: $ cat samples.txt ./sample1.fa ./sample2.fa ./sample3.fa ./sample4.fastq ./sample5.fq ./sample6.fq ... d) To classify in full mode multiple sample files (paired-end reads): $ ./classify_metagenome.sh -O ./samples.R.txt ./samples.L.txt -R ./samples.R.txt -m 0 where, files "samples.R.txt" and "samples.L.txt" contain the addresses of all the fastq files (right and left) to be run: $ cat samples.R.txt ./sample1.R1.fq ./sample2.R1.fq ./sample3.R1.fq ... $ cat samples.L.txt ./sample1.R2.fq ./sample2.R2.fq ./sample3.R2.fq ... Observe in this example that the order of right and left paired-end reads of each sample must be preserved in "samples.R.txt" and "samples.L.txt" . e) To request the express mode, and 8 threads: $ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 2 -n 8 f) To request the full mode, with gzipped objects file, and using 8 threads: $ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 0 -n 8 --gzipped Another example, in default mode, for classifying paired-end reads (./sample1.fastq and ./sample2.fastq): $ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result Notes: This script can run CLARK-l instead of CLARK, for workstations with limited RAM. Then, the user can indicate it with the option "--light". For example: $ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --light This script can run CLARK-S instead of CLARK, if the database files of discriminative of spaced k-mers have been built. To use CLARK-S, the option is "--spaced". For example: $ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --spaced g) To run CLARK-S with full mode and using 8 threads: $ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0 -n 8 --gzipped --spaced h) To run CLARK-S with express mode and using 8 threads on a gzipped file: $ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 2 -n 8 --spaced If you want to run CLARK-S but with a much lower RAM usage then you can decide to download only half the discriminative spaced k-mers in memory using "-s 2". For example: $ ./classify_metagenome.sh -O ./sample.fa -R ./result --spaced -s 2 Note: run "./classify_metagenome.sh" in the terminal to prompt the help/usage describing options and parameters. 2) Abundance estimation The script "estimate_abundance.sh" can analyze CLARK results of a metagenomic sample, and can provide for each target identified, its scientific name, taxonomy id, lineage (superkingdom, phylum, class, order, family, genus), its count and proportion of objects assigned to it. This script also allows to apply some filtering conditions (based on the confidence score or gamma score of assignments) to obtain a stricter estimation. The output format of estimate_abundance.sh is CSV. For example, say a metagenomic sample contains 100 reads, results by CLARK (full mode) indicates that 20 reads (20%) are assigned to the target T1, 70 (70%) are assigned to T2, and 10 (10%) are not assigned ("NA"). Thus, the abundance estimation program reports a count of 20 (or 22.2% of the assigned reads) for T1, 70 (or 77.7%) for T2, and a count of 10 for the category "UNKNOWN" (collecting all unassigned reads or assignments that do not satisfy filtering conditions). We give below examples of command lines. Parameters and options of this script are: estimate_abundance.sh -c -g -D -F .csv .csv ... .csv -a ... Definition of parameters: -c To filter assignments based on their confidence score (if available) using the threshold value minConfidenceScore (a value between 0.5 and 1.0). The abundance estimation for each target will count only assignments with a confidence score higher than minConfidenceScore. Assignments that have a confidence score lower than minConfidenceScore will be marked as unclassified and so counted in the category UNKNOWN in the abundance estimation report. The default value is 0.5. -g To filter assignments based on their gamma score (if available) using the threshold value minGamma (a value between 0 and 1.0). The abundance estimation for each target will count only assignments with a gamma score higher than minGamma. Assignments that have a gamma score lower than minGamma will be marked as unclassified and so counted in the category UNKNOWN in the abundance estimation report. The default value is 0. -D The directory path of the database (the same you indicated when calling set_targets.sh). This parameter is required to load scientific names of all targets ONLY if you pass results of a metagenomic sample. -F .csv ... .csv results file or list of results files produced by CLARK. Important Note: You can pass a results file produced by any mode of execution of CLARK (full, express, spectrum, default), but if you pass several files, make sure they all have been produced under the same mode. For example, if you pass two files result1.csv and result2.csv then result1.csv and result2.csv should be produced through the same mode (e.g., both by full mode). -a To filter abundance estimations that are higher than a certain threshold, minAbundance. minAbundance is a percentage value (between 0 and 100). --highconfidence To automatically count only high confidence assignments for the abundance estimation. This option is equivalent to "-c 0.75 -g 0.03". --krona To export results in a 3-column file for the Krona tool (Hierarchical data browser, Ondov et al, BMC Bioinformatics, 2011, doi: 10.1186/1471-2105-12-385. https://github.com/marbl/Krona/wiki). With this option on, the program will create the file 'results.krn' in the working directory containing the unnormalized abundance from CLARK's assignments (with the filtering options if any). Then, to create the Krona diagram, call the executable ktImportTaxonomy (from your Krona directory) with the option '-m 3'. For example: $ ktImportTaxonomy -o results.html -m 3 results.krn --mpa To export results in a tab-delimited (two-column) file, similar to MetaPhlAn's output. When this option is specified this MetaPhlAn-like reporting is stored in the file "results.mpa" in the working directory. Examples: i) To get the abundance estimation of the results file, ./result.csv: $ ./estimate_abundance.sh -F ./result.csv -D where is the directory of the database you have set with set_targets.sh. To store its output in a file (e.g., "abundance.csv"), you can do by: $ ./estimate_abundance.sh -F ./result.csv -D > abundance.csv ii) To filter high confidence assignments for the abundance estimation (i.e., low confidence assignments will then be reported in the UNKNOWN category): $ ./estimate_abundance.sh -F ./result.csv -D --highconfidence iii) To filter assignnments by using a certain confidence score threshold (for example, 0.8) : $ ./estimate_abundance.sh -F ./result.csv -D -c 0.80 To filter assignnments using a gamma score threshold (for example, -g 0.03) $ ./estimate_abundance.sh -F ./result.csv -D -g 0.03 To output the results of the previous command in the mpa format: $ ./estimate_abundance.sh -F ./result.csv -D -g 0.03 --mpa Note: Filtering based on the confidence score and/or gamma on a results file is possible only if the file was built using the full or spectrum mode. Indeed, default and express mode do not produce confidence scores or gamma scores. IMPORTANT NOTE: From the v1.2.3, results of CLARK-S are automatically filtered. Assignments with low confidence score (i.e., < 0.75) or low gamma value (i.e., < 0.06) are considered as low confidence assignments and thus marked as unknown or "NA". In other words, only high confidence assignments are kept to guarantee both high sensitivity and high precision. iv) If your result files are large (i.e., contain a lot of targets) then the output of the abundance estimation program can be large too. You can filter abundances by setting a minimum abundance percentage. For example, to print out abundances higher than 2% only, then: $ ./estimate_abundance.sh -F ./result.csv -D -a 2 Finally, you can pass several results files at the same time: $ ./estimate_abundance.sh -F ./result1.csv ./result2.csv ... -D In that case, to assure consistency, the results files SHOULD be produced by the same mode (see parameters and options above). Results from different modes of execution SHOULD NOT be mixed. We also remind the user that another script manipulating results files is offered, "evaluate_density_confidence.sh" (resp. "evaluate_density_gamma.sh"). This script estimates and plots the density of assignments per confidence score (resp. gamma score). It takes in input results file(s) produced in the full/spectrum mode only. == RESULTS FORMAT Results are in CSV format. Confidence scores and other statistics about assignments are computed only in the full/spectrum mode. In the full mode (extended), the results format is the following for each line: , ,..., , , , , , , , where : * the "Object_ID" is the tag name indicated in the header (after ">" or "@") for each object, and N is the number of targets. In the spectrum mode, this tag name is merely the filename. * hit count in target i is the number of k-mers specific to target i that are in the object (if the option "--extended" is on). * Length of object is the number of bases (A,C,G,T,U or N) the object has. * Gamma is the ratio between the total number of hit found in the object (against all targets) and the number of k-mers in the object. * first assignment is the target ID of the target that obtained the highest hit count (ties are broken arbitrarily: the 1st assignment is the target having the first hit). * hit count of first is the number of hit count for the first assignment (h1). * second assignment is the target ID of the target that obtained the second highest hit count (ties are broken arbitrarily). * hit count of second is the number of hit count for the second assignment (h2). * confidence score is the ratio : h1/(h1+h2). In the default or express mode, the results format is the following for each line: , ,<1st_assignment>, where : * the "Object_ID" is the tag name indicated in the header (after ">" or "@") for each object. * Length of object is the number of bases the object has. * 1st assignment is the target ID of the target that obtained the highest hit count (ties are broken arbitrarily: the 1st assignment is the target having the first hit). == VERSIONS Version 1.2.6.1 May 11, 2019. Version 1.2.6 February 21, 2019. Version 1.2.5.1 September 20, 2018. Version 1.2.5 February 13, 2018. Version 1.2.4 November 25, 2017. Version 1.2.3.2 August 21, 2017. Version 1.2.3.1 February 27, 2017. Version 1.2.3 April 7, 2016. Version 1.2.2-b December 11, 2015. Version 1.2.1-b June 15, 2015. Version 1.1.3 June 3, 2015. Version 1.1.2 April 22, 2015. Version 1.1.1 April 15, 2015. Version 1.1. February 20, 2015. Version 1.0. September 01, 2014. == CONTACT/INFO For any feedback, suggestion, or question, please feel free to contact Rachid Ounit (clark.ucr.help at gmail.com).
$CLARK_TEST_DATA
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=40g --cpus-per-task=4 --gres=lscratch:50 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ module load clark
Copy a file that links sequences to known classifications. In this case the known sequences are 10 randomly selected bacterial genomes.
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ cp $CLARK_TEST_DATA/simple_species . [user@cn3144]$ head -2 simple_species /usr/local/apps/clark/TEST_DATA/simple_targets/GCF_001534785.1_ASM153478v1_genomic.fna B_subtilis_1423 /usr/local/apps/clark/TEST_DATA/simple_targets/GCF_001831455.1_ASM183145v1_genomic.fna B_pertussis_520
Next, a set of randomly generated 101nt sequences from 5 of these genomes
[user@cn3144]$ cp $CLARK_TEST_DATA/simple_reads.fa . [user@cn3144]$ head -2 reads.fa >B_subtilis_1423|NZ_CP013984.1:1531267-1531367 ACATATGCACCAGTATACACAACGATATAGGAATATATTAAAATCCCTATCATGTGAAATTGCATTGTTTTGCCGATTTGAAGCGGCTTAACCAGTTTTCT
Now classify the reads using the default method
[user@cn3144]$ mkdir -p dbd [user@cn3144]$ # run CLARK using 31-mers in default mode [user@cn3144]$ CLARK -k 31 -T simple_species -D ./dbd -O simple_reads.fa -R simple The program did not find the database files for the provided settings and reference sequences. The program will build them. Starting the creation of the database of targets specific 31-mers from input files... Progress report: (10/10) 37671705 nt read in total. Mother Hashtable successfully built. 36326866 31-mers stored. Hashtable sorting done: maximum number of collisions: 4 Removal of common k-mers done: 36249376 specific 31-mers found. Creating database in disk... 36249376 31-mers successfully stored in database. Loading database [./dbd/db_central_k31_t10_s1610612741_m0.tsk.*] ... Loading done (database size: 1828 MB read, with sampling factor 2) Mode: Default, Processing file: reads.fa, using 1 CPU. - Assignment time: 0.00893 s. Speed: 6718924 objects/min. (1000 objects). - Results stored in results.csv [user@cn3144]$ # quick % correct calculation [user@cn3144]$ tr '|' ',' < simple.csv \ | awk -F',' '$1==$4{c++}END{printf("correct id: %.1f%%\n", c / (NR-1) * 100)}' correct id: 100.0% [user@cn3144]$ # run CLARK using 20-mers in default mode [user@cn3144]$ CLARK -k 20 -T simple_species -D ./dbd -O simple_reads.fa -R simple2 [...snip...]
Using express and full mode:
[user@cn3144]$ # run CLARK using 20-mers in express mode [user@cn3144]$ CLARK -k 20 -T simple_targets -D ./dbd -O simple_reads.fa -R simple3 -m 2 CLARK version 1.2.3 (UCR CSE. Copyright 2013-2016 Rachid Ounit, rouni001@cs.ucr.edu) Loading database [./dbd/db_central_k20_t10_s1610612741_m0.tsk.*] ... Loading done (database size: 1754 MB read, with sampling factor 1) Mode: Express, Processing file: reads.fa, using 1 CPU. - Assignment time: 0.001295 s. Speed: 46332046 objects/min. (1000 objects). - Results stored in results3.csv [user@cn3144]$ # run CLARK using 20-mers in full mode with n threads [user@cn3144]$ CLARK -k 20 -T simple_targets -D ./dbd -O simple_reads.fa \ -n $SLURM_CPUS_PER_TASK -R simple4 -m 0 CLARK version 1.2.3 (UCR CSE. Copyright 2013-2016 Rachid Ounit, rouni001@cs.ucr.edu) Loading database [./dbd/db_central_k20_t10_s1610612741_m0.tsk.*] ... Loading done (database size: 1754 MB read, with sampling factor 1) Mode: Full, Processing file: reads.fa, using 4 CPU. - Assignment time: 0.043682 s. Speed: 1373563 objects/min. (1000 objects). - Results stored in results4.csv
In the following example we'll use more reads simulated from slighly more complex set of genomes (21 genomes from 5 genera). We'll also use some CLARK helper scripts
[user@cn3144]$ mkdir -p dbd2/Custom [user@cn3144]$ cp $CLARK_TEST_DATA/complex_targets/*.fna dbd2/Custom [user@cn3144]$ cp $CLARK_TEST_DATA/complex_reads.fa . [user@cn3144]$ # build a custom database. Replace 'custom' with 'bacteria' [user@cn3144]$ # to build a database covering bacteria. Also available: 'viruses' [user@cn3144]$ # and 'human'. More than one can be given. [user@cn3144]$ set_targets.sh dbd2 custom --species [user@cn3144]$ # run classification [user@cn3144]$ classify_metagenome.sh -O complex_reads.fa -R complex \ -n $SLURM_CPUS_PER_TASK [user@cn3144]$ # estimate abundance and create the file results.krn for krona [user@cn3144]$ estimate_abundance.sh -F complex.csv -D dbd2 --krona [...snip...] [user@cn3144]$ # load the krona module and create a krona visualization [user@cn3144]$ module load kronatools [user@cn3144]$ ktImportTaxonomy -i -m 3 -o complex.html results.krn
The last line creates a hierachical visualization in a self contained html file.
Create a batch input file (e.g. clark.sh), which uses the input file 'clark.in'. For example:
#! /bin/bash module load clark/1.2.6.1 || exit 1 mkdir -p clark_db # run CLARK using 31-mers in default mode CLARK -k 31 -T targets.species -D ./clark_db -O reads.fa -R results
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=2 --mem=40g clark.sh
Create a swarmfile (e.g. clark.swarm). Assuming that a database has been created with set_targets.sh already:
classify_metagenome.sh -O sample1.fa -R sample1 -n $SLURM_CPUS_PER_TASK classify_metagenome.sh -O sample2.fa -R sample2 -n $SLURM_CPUS_PER_TASK classify_metagenome.sh -O sample3.fa -R sample3 -n $SLURM_CPUS_PER_TASK
Submit this job using the swarm command.
swarm -f clark.swarm -g 40 -t 4 --module clark/1.2.5where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module clark | Loads the clark module for each subjob in the swarm |