CLARK uses discriminative k-mers to classify sequences. The classification is done by first analyzing a set of sequences with known classes and finding k-mers that allow discrimination between the know classes. The resulting database can then be used to classify new sequences. This can be used to, for example, classify short reads taxonomically at different levels (genus, species, ...) or to assign BACs to chromosome arms.

Three different algorithms are included in this package: (1) CLARK - the original algorithm. Can be quite memory hungry for large databases and kmer sizes. (2) CLARK-l - requires less memory at the cost of slightly reduced power. (3) CLARK-S - uses spaced kmers. Requires the most RAM and has higher sensitivity.

In addition to the basic executables, a number of convenience scripts for carrying out common analyses are provided as well. See the README below as well as the CLARK documentation.

The Clark Readme is included here since it contains a lot of useful information that is not immediately available on the clark home page.

Readme (1.2.6.1)

== AUTHORS

Rachid Ounit (1), Steve Wanamaker(2), Timothy J Close (2) and
Stefano Lonardi (1).
1: Computer Science and Engineering department, University of California,
Riverside CA 92521
2: Botany and Plant Sciences department, University of California,
Riverside CA 92521

== ABOUT

The problem of DNA sequence classification is central to several application
domains in molecular biology, genomics, metagenomics and genetics. The problem
is computationally challenging due to the size of datasets generated by modern
sequencing instruments and the growing size of reference sequence databases.
We present a novel method, called CLARK (CLAssifier based on Reduced K-mers),
for supervised sequence classification based on discriminative k-mers. Somewhat
unique among other metagenomic and genomic classification methods, CLARK provides
a confidence score for its assignments which can be used in downstream analysis.
We demonstrated the utility of CLARK on two distinct specific classification
problems, namely (1) the assignment of metagenomic reads to known bacterial
genomes, and (2) the assignment of BAC clones and transcript to chromosome arms
(in the absence of a finished assembly for the reference genome).

Three classifiers or variants in the CLARK framework are provided :
CLARK (default): created for powerful workstation, it can require a significant
amount of RAM to run with large database (e.g., all bacterial genomes from
NCBI/RefSeq). This classifier queries k-mers with exact matching;

CLARK-l : created for workstations with limited memory (i.e., "l" for light),
this software tool provides precise classification on small metagenomes. Indeed,
for metagenomics analysis, CLARK-l works with a sparse or ''light'' database
(up to 4 GB of RAM) that is built using distant and non-overlapping k-mers.
This classifier queries k-mers with exact matching;

CLARK-S : created for powerful workstation exploiting spaced k-mers
(i.e., "S" for spaced), this classifier requires a higher RAM usage than
CLARK or CLARK-l, but it does offer a higher sensitivity.
CLARK-S completes the CLARK series of classifiers.

== GENERAL NOTES

The performance of CLARK was introduced, as a poster presentation, at the 5th
ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
(ACM-BCB), September 20-23, 2014 at Newport Beach, CA.

CLARK-S was presented at the 14th Workshop on Algorithms in Bioinformatics, Atlanta,
GA, September 2015.

== LICENCE

The source code of CLARK is distributed under the GNU GPL License. CLARK and its
variants (such as CLARK-l) are a free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.
CLARK and its variants are distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details at http://www.gnu.org/licenses/
Copyright 2013-2017, Rachid Ounit (clark.ucr.help at gmail.com).

== PUBLICATIONS

The classifiers in the CLARK framework are part of the following peer-reviewed
publications. Please cite these two papers to give full credit:

Ounit, R., Wanamaker, S., Close, T. J., & Lonardi, S. (2015). CLARK: fast and
accurate classification of metagenomic and genomic sequences using discriminative
k-mers. BMC genomics, 16(1), 236.

Ounit, R., & Lonardi, S. (2016). Higher classification sensitivity of short
metagenomic reads with CLARK-S. Bioinformatics, 32(24), 3823-3825.

== RELEASE

On 09/01/2014, the version 1.0 is available:
This version of CLARK is written in C++ and shell script. It is designed for
64-bit OS, and can be executed on the latest Linux and Mac systems.
On 02/20/2015, the version 1.1 is available:
Compared to v1.0, this version is more efficient to store the database in disk
(some non-necessary data were pruned, and thus a relative reduction of about 15%
can be observed for the disk space needed). The user can also observe the program
loads the database in RAM much faster than before. This version also allows to load
the database in multithreaded-task. Finally, the v1.1 uses different settings for
the sampling method used for the default mode, which leads to an increase of the
classification speed of about 5%.
On 04/15/2015, the version 1.1.1 is available:
Compared to v1.1, this version is more efficient to load the database from disk to
RAM. The relative reduction is about 16% in average. This improvement also implies
an increase of the classification speed. This release also includes scripts to
facilitate the classification of metagenomic samples (see section "CLASSIFICATION OF
METAGENOMIC SAMPLES").
On 04/22/2015, the version 1.1.2 is available:
This release includes scripts to produce the abundance estimation per target (i.e.,
the count and proportion of objects assigned to a target) with filtering possible
on the confidence score. Also, the user can get the density of assigned objects per
confidence score. The user can also pass to CLARK objects as compressed
files (GZ format) directly to "classify_metagenome.sh". Finally, bugs fixes and
code improvement are made.
On 06/03/2015, the version 1.1.3 is available:
This release extends existing features, simplifies output produced by the full mode
(to reduce the disk spaced needed for the results file). Bugs fixes and
code improvement are included.
On 06/15/2015, the version 1.2.1-beta is available:
This release contains the discriminative spaced k-mers (using multiple spaced-seed).
This version is still in the beta version while features and code improvements on
the RAM usage are on-going.
Bug fixes and code improvement are included.
On 12/11/2015, the version 1.2.2-beta is available:
The speed and RAM usage of the full and spaced mode are significantly improved:
i) the RAM usage is now scalable and the upper-bound is lower than the maximum
usage needed in previous versions;
ii) the speed is increased by a factor of 5 to 9, in singlethreaded tasks.
To analyze results, a script to plot the distribution of the assignments per gamma
score (as done with confidence score) is provided.
In addition, the option to create Krona-compatible input files
("http://www.biomedcentral.com/1471-2105/12/385") is available.
On 04/07/2016, the version 1.2.3 is available:
The classifier CLARK-S is finalized. Scripts to generate the targets definition
using the accession number instead of the GI number have been updated.
Additional scripts have been added to facilitate the creation and changes
of the customized databases. Code improvements and bug fixes are included.
On 04/20/2017, the version 1.2.3.1 is available:
A new script to summarize results from several reports files is provided.
Copyright and email info are updated.
On 08/21/2017, the version 1.2.3.2 is available:
The script to summarize results (per report and per taxon identified) across
several reports files has been improved/extended. Paired-end reads in Fasta
files can be passed (not only in fastq format anymore).
Code improvements and bug fixes are included.
On 11/25/2017, the version 1.2.4 is available:
In addition to the bacteria/archaea, viruses and human, a database of fungi is
made available. A list of 29 reference genomes from RefSeq corresponding to
known hospital-acquired infections and other nosocomial diseases represents the
first basis of this new database. Code improvements and bug fixes are included.
On 02/13/2018, the version 1.2.5 is available:
In addition to the bacteria/archaea, viruses, fungi and human database, we
extended the fungi database and added the plasmid, plastid and protozoa databases
(from RefSeq complete genomes). Copyright update, code improvements and bug fixes
are included.
On 09/20/2018, the version 1.2.5.1 is available:
A new script is added to print the distibution of the target-specific k-mers.
An option is added to allow processing of very long reads for the full mode and
for CLARK-S. Code improvements and bug fixes are included.
On 02/20/2019, the version 1.2.6 is available:
A new script is added to extract sequences identified to a specific taxon. The
script "estimate_abundance.sh" allows now to output the results in the mpa format
(tab-delimited format from MetaPhlAn).Code improvements and bug fixes are included.

This versatile method allows the classification of objects sequences against a
database of targets. Sequences of targets must be given by standard fasta and/or
fastq files, or eventually text file (two-column format, where each line is
, for example in the case of the file "target.txt" (using 6-mers):

$ cat target.txt
ATATAT 1234
GCGCGC 435
ATTTTT 234
...
)

== COMPATIBILITY BETWEEN VERSIONS

Between v1.* and v1.0:
The version 1.1 (or higher) does use a different algorithm when storing and loading
the database, compared to v1.0. Therefore, databases files (TSK files) produced by
the version 1.0 are not compatible with v1.1. This is why it is needed to rebuild
the database using v1.1 (or newer version). When using v1.1 (or newer), three files
are created for the database.
Discriminative k-mers and their targets ID are stored respectively in the files of
extension *.ky and *.lb. Sizes of all buckets of the hashtable are stored in the
file of extension *.sz.

== SOFTWARE & SYSTEM REQUIREMENTS

1) EXTERNAL TOOLS & C++ COMPILER VERSION
unlike most of other metagenomic and genomic classifiers, CLARK does not require
Uny tool from the BLAST family, Jellyfish or other external tool. The main
requirement is a 64-bit operating system (Linux or Mac), and the GNU GCC to
compile version 4.4 or higher. Multithreading operations are assured by the openmp
libraries. If these libraries are not installed, CLARK will run in single-threaded
task. CLARK is expected to successfully run on recent Mac and Linux OS.

2) MEMORY
CLARK can be RAM consuming, especially when loading large database(s). Latest versions
since v1.1.1 require about 58 GB in RAM to load the database in default mode (built
from bacteria genomes, default NCBI/RefSeq) and about 156 GB to build that database,
using 31-mers. Using 20-mers instead, it requires 49 GB to load the database
(and 107 GB to build the database). About CLARK-S, the memory usage during the
classification is the highest (~101 GB) because the program loads in memory several
sets of spaced k-mers (from several spaced seed).

For the space required in disk, CLARK needs about 36 GB to store the database built
for bacteria (at the genus level).

If memory is a concern to you and you have small metagenomes then please consider
using the RAM-light version, CLARK-l. This tool is designed to run on 4 GB RAM
laptop, thus the memory consumption is the main objective of this tool.
CLARK-l can load data from bacteria and viruses databases within 4 GB.
If it can't then the user should change the parameter value for the gap
(see "-g " option, in "MANUAL & OPTIONS") and set the iteration to a
value higher than the default, 4.
IMPORTANT NOTE: Results given by CLARK-l should be seen as a draft (or first order
approximation) of results you would get by running CLARK or CLARK-S.

== INSTALLATION

First, download the zipped package of the latest version (v1.2.5.1), available from
the CLARK webpage ("Download tab"), http://clark.cs.ucr.edu.
Second, uncompress the tar.gz file ("tar -xvf CLARKV1.2.5.1.tar.gz"), then go to
the sub-directory "CLARKSCV1.2.5.1" and execute the installation script ("./install.sh").

The installation is done! You can now run CLARK and any of the provided scripts.

The installer builds binaries (CLARK, CLARK-l and CLARK-S, in the folder "exe" in
CLARKSCV1.2.5.1).

== SCRIPTS

After the installation, you can also notice that several scripts are available.
Especially:

- set_targets.sh and classify_metagenome.sh: They allow you to define your database
and to classify your metagenomes against it. set_targets.sh can be called
indefinitely to build several database(s) (downloaded from NCBI or available
"locally" in your disk).
Note the last call of set_targets.sh defines the "working database",
the database that classify_metagenome.sh and other scripts use as reference for their
computations.

- estimate_abundance.sh: It computes the abundance estimation (count/proportion
of objects assigned to targets).

- evaluate_density_confidence.sh and evaluate_density_gamma.sh: These two scripts
take in input one or several CLARK results (containing confidence/gamma scores)
and output/plot the density of assignments per confidence score (or gamma score).

- buildSpacedDB.sh: It creates the sets of discriminate spaced k-mers from the
selected databases.

- clean.sh: This script will delete permanently all data related (generated and
downloaded) of the database directory defined in set_targets.h.

- resetCustomDB.sh: It resets the targets definition with sequences (newly
added/modified) of the customized database. Any call of this script must be
followed by a run of set_target.sh.

- updateTaxonomy.sh: To download the latest taxonomy data (taxonomy id,
accession numbers, etc.) from the NCBI website.

- makeSummaryTables.sh: To build three tables summarizing results from one or several
CLARK reports file. Given a rank r and a minimum abundance level (i.e., the ratio
between number of reads classified to that taxon and the total number of reads in
the report file) l, expressed in percent, passed in parameters, this script produces
the following reports:
=> "TableSummary_per_Report.csv":
For each report file, this table indicates the number of reads, the number of
assigned or classified reads, the alpha-diversity (computed using the Shannon-Index
on the relative proportion), the minimum abundance level used, the proportion of reads
classified in percentage, and the r most dominant taxa/organisms with abundance higher
than l (they are ordered by decreasing order of the abundance). The taxa are reported
with scientific name and NCBI taxonomy id and per decreasing order of the proportion.
=> "TableSummary_per_Taxon.csv":
For each taxon/organism identified across all the report files, this table indicates
the taxonomy id of the taxon, the domain of the taxon, the minimum abundance level used,
the number of reports the taxon was found, some statistics about the abundance of the
taxon (i.e., minimum, maximum, average and standard-deviation in percent).
=> "TableSummary_HitCount.csv":
This table provides the hit count distribution of the targets identified per report,
as well as the total number of reads, the total number of reads classified and the
alpha-diversity.

- getTargetsKmers_distribution.sh: To print out the distribution of the target-specific
k-mers in a file entitled "targets.distribution.csv" in the working database.
The parameters for this script are the key parameters used for creating the database,
i.e., the k-mer length and the minimum k-mers frequency (default 0).

- extractSequences.sh: To extract/filter from the input data, the sequences identified
to a specific taxon once the classification is done. This can be used for downstream
analysis (analysis of contaminants, genome assembly, etc.).
Parameters are: The taxonomy id (of the taxon to look up), the address of the file
containing sequences (fasta/fastq format), the address of the CLARK results file
(csv file), and the minimum thresholds for the gamma value and the confidence score
(in the case the CLARK results file contains these statistics).

== INTRODUCTION ON CLARK USAGE

To introduce how to use CLARK, we first begin with some definitions:

- target: a reference sequence or a set of reference sequences (e.g., reads, contigs,
assemblies, etc.) of any length. In metagenomics, for a particular taxonomy rank r,
any target can be a genome, or genomes of a particular taxon defined at the level r.
see examples in next sections). In genomics, a target can be reads or the assemblies
of a chromosome (or any other genomic region of a genome).
Thus, a target is defined by the context of the classification.

- object: a sequence (e.g., a read, a contig, a scaffold/assembly, etc.), of any length,
to be classified against a set of targets defined at the same taxonomy rank (see
examples in next sections).

In general, target or object, sequences can be fasta or fastq files (or even spectrum),
see details in next sections.

We provide examples of how to classify a set of objects against a set of targets.
The general rule is the user must define the targets definitions (i.e., the
(reference sequence) <-> (target ID) association), pass objects to be classified,
files to store results and a directory to store the database (of target-specific k-mers).
This database is needed, because time-consuming computations are needed to
find and saved in disk target-specific k-mers. Once these data are saved in disk,
it will be loaded directly and quickly if CLARK needs them again.

We describe in detail and also provide examples in next sections. All options/parameters
available are detailed in the section "MANUAL & OPTIONS".

For metagenomics, scripts are provided to download/preprocess databases
from NCBI (i.e., genomes related to Bacteria, Viruses and Human), see section
"CLASSIFICATION OF METAGENOMIC SAMPLES".

== HOW TO CHOOSE THE K-MER LENGTH ?

CLARK is a k-mer based classification method, and the value of k is critical for the
classification accuracy and speed. The user should use any value between 19 and 23
in order to get high sensitivity. For high sensitivity, we recommend k = 20 or 21
(along with the full mode).

However, if the precision and the speed are the main concern, the user should use
any value between 26 and 32. However, the highest is k, the higher is the RAM usage.
As a good tradeoff between speed, precision, and RAM usage, we recommend k = 31
(along with the default/express mode). The default k-mer length is 31.

In the version 1.*, CLARK supports any k-mer length up to 32.

== MANUAL & OPTIONS

In this section, we describe options directly available with the CLARK executable
(binary file in the "exe" directory).

CLARK offers several options to run the classification.

A typical command line to run CLARK (or CLARK-l/CLARK-S) looks like:
$ ./CLARK -k -T -t -D -O -o -R -m -n ...

Definitions of parameters:

-k , k-mer length: integer, >= 2 and <= 32 (in version 1.*).
The default value for this parameter is 31, except for CLARK-l (it is 27).
For CLARK-S, k is set to 31 and the maximum number of mismatches is set to 9.

-T , The targets definition is written in fileTargets: filename.
This is a two-column file (separated by space, comma or tab), such that for each line:
column 1: the filename of a reference sequence
column 2: the target ID (taxon name, or taxonomy ID, ...) of the reference sequence
(See example below).
This parameter is mandatory.

-t , minimum of k-mer frequency/occurence for the discriminative k-mers: integer, >=0.
The default value is 0. For example, for 1 (or, 2), the program will discard any
discriminative k-mer that appears only once (or, less than twice).

-D , Directory of the database : pathname.
This parameter is mandatory.

-O , file containing objects: filename.
This parameter is mandatory.

-P , Paired-end fastq files: filenames.

-o , minimum of k-mer frequency or occurence in objects: integer, >=0.
This option is only available in the spectrum mode (-m 3).

-R , file to store results: filename.
Results are stored in CSV format in the file .csv (the extension
".csv" is automatically added to the filename).
This parameter is mandatory.

-m , mode of execution: 0,1,2 or 3
0 (full): To get detailed results, confidence scores and other statistics.
This mode loads all discriminative (contiguous/spaced) k-mers.
This mode is available for CLARK, CLARK-S and CLARK-l.
1 (default): To get results summary and perform best trade-off between
classification speed, accuracy and RAM usage.
This mode loads half discriminative (contiguous/spaced) k-mers.
This mode is available for CLARK and CLARK-l.
2 (express): To get results summary with the highest speed possible.
This mode loads all discriminative (contiguous/spaced) k-mers.
This mode is available for CLARK, CLARK-S and CLARK-l.
3 (spectrum):To classify data that are given in spectrum form (i.e.,
sequence data are in a two-column file, ).
This mode is available for CLARK and CLARK-l.

-n , number of threads: integer >= 1.
The program runs in parallel for the classification and, with the option
'--ldm' on, the loading of the database into RAM.

-g , "gap" or number of non-overlapping k-mers to pass when creating the
database (for CLARK-l only). The bigger the gap is, the lower the RAM
usage is. The default value is 4, but the user can increase this value if
needed to reduce the RAM usage but this will degrade the sensitivity.

-s , sampling factor value (for CLARK/CLARK-S only). If you want the program to load
in memory only half the discriminative (contiguous/spaced) k-mers then use
"-s 2". If you want a third of these k-mers then use "-s 3", etc.
The higher this factor is, the lower the RAM usage is.
The higher this factor is, the higher the classification speed/precision is.
However, our experiments show that the sensitivity can be quickly
degraded, especially for values higher than 3.
In the default mode, this factor is set to 2 because it represents a good
trade-off between speed, accuracy and RAM usage.

--long, to indicate that the objects files contains very long/large sequences (e.g.,
long contig from genome assembly, long sequencing reads from Nanopore or Pacbio,
etc.) and allocated more memory accordingly.
This option is only for running the full mode or running CLARK-S,
in the case of sequences up to 50 Mbp.

--tsk, (to request a detailed creation of the database (target specific k-mers files).
For each target, the program creates a file containing all its specific k-mers.
BEWARE: This option requires a high amount of disk space to complete.)
This option is no more supported.

--ldm, to request the loading of database file by memory mapped-file.
This option accelerates the loading time but it will require an
additional amount of RAM significant. This option also allows to
load the database in multithreaded-task.

--kso, to request a preliminary k-spectrum analysis of each object
(for mode 3 only). This analysis gives intervals of k-mers
frequency for which k-mers are assumed to be informative
of the object. This interval is reported in the results file.

--extended to request an extended output (result file), only for the full mode.

When a filename is required, we recommend absolute paths.

== LOW-LEVEL DESCRIPTION & EXAMPLES

This section describes main features and options for running directly the executable
CLARK (or CLARK-l/CLARK-S) built (in the folder "exe") after the installation. If you want
to use CLARK as a metagenome classier, we recommend you use the provided scripts
for that purpose and detailed in the section "CLASSIFICATION OF METAGENOMIC SAMPLES".
However, we recommend the reader to go through this section to become familiar with
CLARK's options/parameters.

If you work directly with CLARK and CLARK-l files (e.g., to classify BAC/transcript
to chromosomes) then we recommend you copy them to your working directory.

In this section, the working directory is /home/user1/bin/ and the binaries CLARK, CLARK-l
and CLARK-S have been copied in it. Indeed,

$ pwd
/home/user1/bin/
$ ls
CLARK CLARK-l CLARK-S
$

We describe in the following paragraph 1) how to create the targets definition with
examples for it, in the case CLARK is set to classify metagenomic samples. Then,
we describe how to set different modes, use paired-end reads and run multiple threads.
Finally, we explain how to use CLARK for assignning BAC/transcript to chromosome-arms/
centromeres.

1) Generalities, and Default Mode:

- Say, for simplicity, the user have ten genomes in its database and each genome is
stored into one fasta file and is located in the current working directory (say,
genome_1.fa, genome_2.fa, ..., genome_10.fa). Say that the first three genomes belong
to the same species S1, the next three genomes belong to the same species S2, and
the last four genomes belong to S3. And finally, say that the first six genomes belong
to the same genus G1, and the last four genomes belong to the same genus G2.

- Sequences (or objects) to be classified are three short single-reads all stored in
one fasta file, "objects.fa".

First, the user must define targets for the classification (the "targets_addresses.txt"
file). For metagenomics, for a classification at the species-level, then the user
must define targets accordingly. So, at the species level, the targets_addresses.txt
file is:

$ cat targets_addresses.txt
/home/user1/bin/genome_1.fa S1
/home/user1/bin/genome_2.fa S1
/home/user1/bin/genome_3.fa S1
/home/user1/bin/genome_4.fa S2
/home/user1/bin/genome_5.fa S2
/home/user1/bin/genome_6.fa S2
/home/user1/bin/genome_7.fa S3
/home/user1/bin/genome_8.fa S3
/home/user1/bin/genome_9.fa S3
/home/user1/bin/genome_10.fa S3
$

In this case, there are three distinct targets (S1, S2 and S3) that the program
will use for the classification. Defining targets at the genus-level would be:

$ cat targets_addresses.txt
/home/user1/bin/genome_1.fa G1
/home/user1/bin/genome_2.fa G1
/home/user1/bin/genome_3.fa G1
/home/user1/bin/genome_4.fa G1
/home/user1/bin/genome_5.fa G1
/home/user1/bin/genome_6.fa G1
/home/user1/bin/genome_7.fa G2
/home/user1/bin/genome_8.fa G2
/home/user1/bin/genome_9.fa G2
/home/user1/bin/genome_10.fa G2
$

In this case, there are 2 distinct targets (G1 and G2). Results and performance of
CLARK have been analyzed at the genus and the species level (see the manuscript).

In all these examples, each genome file is given by the absolute path (recommended),
however the user can also use relative path. Since files are in the working
directory (where the binaries are), we can simply have:

$ cat targets_addresses.txt
./genome_1.fa G1
./genome_2.fa G1
./genome_3.fa G1
...
$

Naturally, G1 and G2 are just labels that the user is free to choose. The user
can rather use the full genus names (e.g., Staphylococcus and Streptococcus instead of
G1, and G2), or, as recommended, the taxonomy ID (i.e., 1279 and 1301 resp.).

The definition of targets is similar for genomics (the reader can see the paragraph
"6) Chromosome arms and Centromeres"). Defining targets for genomics or metagenomics
is equivalent.

Second, based on the definition of the targets, CLARK builds a database. This database
needs a directory to be created. We recommend the user to create a specific
directory for each targets definition. For this example, we use the directory "./DBD/".
This directory can be anywhere in your disk, and independantly to the directory you
chose for the source code.

Third, objects are short reads present in the file "objects.fa". Then, we want
the classification results to be stored in the file named "results". The results
format is CSV (see "RESULTS FORMAT" for more detail). If a file "results.csv" already
exists then its content will be erased and replaced.

Finally, to run CLARK, using 31-mers, in default mode, run:
$ ./CLARK -k 31 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results

Or using 20-mers:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results

Note that the results file will have the extension "csv".

If CLARK is run for the first time on the defined targets and the given k-mer length,
then CLARK builds first the database, which gathers all Target Specific K-mers (TSK).
If CLARK is run again on the same targets and same k-mer length, then it will directly
load TSK files previously produced. Once the classification is done,
the file results.csv is created.

1.1) Exclusion of discriminative k-mers of low frequency in the database

The user can request the removal of discriminative k-mers that have a frequency equal
or lower than a certain threshold (like 1, or 2, or, ...).
This can be done by using the option "-t", for which the default value is 0
(i.e., "-t 0"). For example, to exclude discriminative k-mers that "appear" only once,
the command line from the last example in 1) is :
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 1

To exclude discriminative k-mers that appear twice or less:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -t 2

See the "MANUAL & OPTIONS" section for more info.

1.2) Exclusion of k-mers of low frequency in the objects

Similarly, the user can also request to ignore k-mers in objects that have a frequency
equal or lower than a certain threshold (option "-o").
In version 1*, this option is available ONLY for the spectrum mode (mode 3), see
"MANUAL & OPTIONS".

2) Fast, Full and Spectrum Mode:

Previous instructions used the default mode of CLARK (i.e., "-m 1"). To use the "express"
mode to get results more quickly, then simply add the option "-m 2". For the "full" mode
that will generate detailed results (including confidence scores), add "-m 0".

For example, using 20-mers, to run in express mode:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 2

In full mode:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 0

Note that
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 1
is equivalent to
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results

In the case objects are given as spectrum (a two-column file), the user should
indicate it with the option "-m 3":
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 3

IMPORTANT NOTES:
Whatever you use CLARK, CLARK-l or CLARK-S:
- The full mode (i.e., "-m 0") loads all discriminative k-mers in RAM and provides
confidence score for all assignments. Thus, it offers high sensitivity.
If your data are in the spectrum format, then the spectrum mode (-m 3) is a
solution for you.
- If your primary concern is the speed, and do not need detailed results, nor
confidence scores, then use the "express" mode (-m 2), the fastest mode.

For CLARK or CLARK-l:
- If you do not know what mode to use, then use the default mode (-m 1). It loads
in RAM about half discriminative k-mers stored in disk. Thus, the RAM usage is
lower than that of other modes, but its sensitivity is not the highest. However,
the precision and speed are high, and the default mode is a good compromise between
speed, RAM usage and high precision/sensitivity.

Finally:
- If you analyze a metagenomic sample from a poorly known microbial habitat (i.e.,
the default RefSeq database is likely not to contain a large fraction of genomes of
organisms present in your sample), for example, sea water, etc. then use CLARK-S
with full mode. CLARK-S uses multiple spaced k-mers instead of contiguous k-mers,
and it produces detailed results. However, this variant is the most RAM-consuming.
Unlike CLARK, results produced by CLARK-S are already filtered (i.e., assignments
with confidence score < 0.75 or gamma score < 0.06 are rejected).
However, you can use a stricter filtering to get more precise results
(see option of the script estimate_abundance.sh).

2.1) Running CLARK-S:

The current release exploits spaced k-mers of length 31 and weight 22. Before
classifying your metagenomic sample, the database of discriminative 31-mers (e.g.,
from bacterial genomes) and then the database of discriminative spaced 31-mers
(using the script "buildSpacedDB.sh") must be created.

Step 0 (Set the database and its directory "DBD"):

$ ./set_targets.sh ./DBD/ bacteria --phylum
(or $ ./set_targets.sh ./DBD/ bacteria --species)

where "DBD" is the directory where you want to store bacteria genomes and
create the database.

Step 1 (Build the database of discriminative 31-mers):

If the database files of 31-mers are already created for the database and the rank
specified in step 0 then you can skip this step.
This can be done by running any classification with k=31, for example:
$ ./classify_metagenome.sh -O sample.fa -R result

where sample.fa is some fasta file data. This operation is only needed to create the
database of discriminative 31-mers.

Step 2 (Create the databases of discriminative spaced k-mers):

$ ./buildSpacedDB.sh

This task will take several hours to create discriminative spaced k-mers for all the
3 spaced seeds (i.e., about 6 to 7 hours). This operation will write about 60GB
of data on disk.

Finally, to classify your metagenome (for example, sample.fq) run:
$ ./exe/CLARK-S -T ./targets_addresses.txt -D ./DBD/ -O ./sample.fq -R ./result -m 0
or
$ ./classify_metagenome.sh -O sample.fq -R result --spaced

Note: If you decide to work with a different taxonomy rank and/or database then
you will need to repeat the step 0 and step 1.

3) Paired-end reads:

CLARK accepts paired-end reads (fastq/fasta files) as objects. Indicate it with the
option "-P" and the two filenames (CLARK will concatenate paired reads with
matching ID). An example of command line, using default mode, is:

$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -P ./sample1.fastq ./sample2.fastq -R paired.results

4) Multi-Objects (or multiple sample files):

The program can process several sample files at the same time (i.e., no need to
reload the database or restart the program for each sample).
To classify several fasta/fastq files under the same settings (e.g., mode, threads,
etc.), it is possible to pass all of them conveniently (i.e., without the need to
reload the database for each file). Say the user has 3 fasta files, located in the
current working directory, ./objects_1.fa, ./objects_2.fa and ./objects_3.fa.
To pass these three files at once, create a file (e.g., "objects.txt") containing
filenames of these fasta files:

$ cat objects.txt
./objects_1.fa
./objects_2.fa
./objects_3.fa
$

The objects.txt file is like a "meta-objects" file. Then, create a meta-results
file (called "results.txt") such that the i-th line in it indicates the filename
to store results
of the i-th fasta file indicated in "objects.txt":

$ cat results.txt
./results_1
./results_2
./results_3
$
./results_1.csv will contain results related to ./objects_1.fa, ..., and
./results_3.csv will contain results related to ./objects_3.fa.

Indeed, once the classification is done, results_1.csv, results_2.csv and results_3.csv
are created. This way assures that the user can pass any amount of objects files
in a scalable fashion.

Note: the "results.txt" file can be just a duplicate (or a simple copy) of "objects.txt".
Since the program creates results files with extension "csv", so you can create these results
files by using the same filename of the fastq/fasta file as prefix.

For example, if you run:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./objects.txt

The program will store the results files as ./objects_1.fa.csv, ./objects_2.fa.csv and ./objects_3.fa.csv

Same for paired-end reads:
$ ./CLARK -k 20 -T ./targets_addresses.txt -O ./DBD/ -P ./paired1.txt ./paired2.txt -R ./results.txt

where files paired1.txt aand paired2.txt contain addresses of paired-end reads:
$ cat paired1.txt
./sample1_R1.fq
./sample2_R1.fq
./sample3_R1.fq

$ cat paired2.txt
./sample1_R2.fq
./sample2_R2.fq
./sample3_R2.fq

$ cat results.txt
./results_1
./results_2
./results_3

Once computations done, results for the paired-end reads for each of the three samples
are stored in: results_1.csv, results_2.csv and results_3.csv.

5) Multi-threading:

CLARK can exploit parallel threads to increase the computations speed.
The option to add is "-n ". For example, using 12 threads:
$ ./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.txt -R ./results.txt -n 12

6) Chromosome arms and Centromeres

We understand that users can work with chromosome arms as targets. In this case,
centromeres can be inferred by CLARK. To allow that, the user must indicate in the
targets_addresses.txt file sequences that can be related to centromeres.

For example, in the case of barley, consider 6 chromosome arms defined by 6 files
(./chr_arm1L.fa, ./chr_arm1S.fa, ..., ./chr_arm3L.fa, ./chr_arm3S.fa). If the user
does not want to define centromeres, then targets_addresses.txt would look like:

$ cat targets_addresses.txt
./chr_arm1L.fa 1HL
./chr_arm1S.fa 1HS
./chr_arm2L.fa 2HL
./chr_arm2S.fa 2HS
./chr_arm3L.fa 3HL
./chr_arm3S.fa 3HS
$

If the user wants the program to infer centromere of 2H only, then it will
indicate the centromere name for it (2HC) at each target related to the chromosome
2H (so ./chr_arm2L.fa, and ./chr_arm2S.fa) on the same line:

$ cat targets_addresses.txt
./chr_arm1L.fa 1HL
./chr_arm1S.fa 1HS
./chr_arm2L.fa 2HL 2HC
./chr_arm2S.fa 2HS 2HC
./chr_arm3L.fa 3HL
./chr_arm3S.fa 3HS
$

If all centromeres must be inferred, then targets_addresses.txt would look like:

$ cat targets_addresses.txt
./chr_arm1L.fa 1HL 1HC
./chr_arm1S.fa 1HS 1HC
./chr_arm2L.fa 2HL 2HC
./chr_arm2S.fa 2HS 2HC
./chr_arm3L.fa 3HL 3HC
./chr_arm3S.fa 3HS 3HC
$

When defining chromosome arms labels, we strongly recommend that the user follows the
naming convention used of this example:
For the N-th chromosome, "NS" is the label for the short arm and "NL" is the
label for the long arm and (same prefix), so "NC" is the label for the centromere.

IMPORTANT NOTE:
Examples of command lines above using fasta files are valid as well for
fastq files, except for the paired-end reads case.

== CLASSIFICATION OF METAGENOMIC SAMPLES

We provide several scripts to facilitate the classification in the context of
metagenomics. CLARK can preprocess databases of bacteria, viruses, plasmid, plastid,
protozoa, fungi or human (downloaded from NCBI/RefSeq) or a customized set of genomes.

First, we present here two scripts, "set_targets.sh" and "./classify_metagenome.sh" that
work together.
Second, we present the script to get the abundance estimation (count and proportion for
each target identified), "estimate_abundance.sh", from one or several results file(s).

1) Setting and classification

1.1) Step I: Setting targets

After the installation (./install.sh), the user must create a directory to store all
reference sequences (bacteria, viruses, plasmid, plastid, protozoa, fungi, human and custom).
For all our examples below, we name this directory path in a generic way for clarity.
This directory can be anywhere in your disk(s).

Then, the user must indicate what database(s) to consider for the classification among
bacteria, viruses, ..., fungi, human and/or custom.

For example, only bacteria genomes:
$ ./set_targets.sh bacteria

To work with bacteria, viruses and human:
$ ./set_targets.sh bacteria viruses human

To work with bacteria, viruses, fungi and human:
$ ./set_targets.sh bacteria viruses fungi human

To classify against a custom database:
The user will need to copy/move the sequences (fasta files with accession numbers in the
header, i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ...",
and one fasta file per reference sequence) in the directory "Custom", inside ,
and then run set_targets. In order words, the user must:
(1) create the directory "Custom" inside (if it does not exist yet);
(2) copy/move the sequences of interest in the Custom;
(3) run:
$ ./set_targets.sh custom
or for example, when combining the bacteria genomes with the Custom sequences:
$ ./set_targets.sh bacteria custom

In general case (when the user selects bacteria, viruses, ..., fungi and/or human),
if the directory is empty, then the script will download all
the selected database(s) and also data of the taxonomy tree, from NCBI.
Once the sequences are found or downloaded in the directory, the script will build
the targets for a given taxonomy rank.

The default taxonomy rank is species. To use a different taxonomy rank, for example,
genus, the command line is (from the example selecting bacteria, viruses and human):

$ ./set_targets.sh bacteria viruses human --genus

In the current release, the user can choose between six ranks (species to phylum):
--species (the default value), --genus, --family, --order, --class or --phylum.

Indeed, the strength of CLARK is to be able to classify quickly and accurately
metagenomic samples by using only one taxonomy rank. So as a general rule when
classifying metagenomic reads:
Consider first the genus or species rank, then if a high proportion of reads
cannot be classified, reset your targets definition at a higher taxonomy rank
(e.g., family or phylum).

Once set_targets.sh is finished, the user can proceed to the step II. However, if
the user wants to modify the selected databases and/or taxonomy rank, then he/she will
need to run again set_targets.sh with updated parameters before proceeding to step II.

1.2) Step II: Running the classification

The script to run the classification of a metagenomic sample against the database(s)
previously set in step I is "classify_metagenome.sh".

For your convenience, this script runs the executable CLARK, CLARK-l or CLARK and allows
you to pass few parameters.

For example, say objects to be classified are reads in "sample.fa" (e.g., located in the
current directory), and results to be stored in "result.csv". A basic command line is:

$ ./classify_metagenome.sh -O ./sample.fa -R ./result

As explained in the section "MANUAL & OPTIONS", thanks to identifiers "-O" and "-R",
the script will pass the objects file "sample.fa" and results will be stored
in "./result.csv". Objects are classified against the targets and the taxonomy rank
defined by the last execution of ./set_targets.sh.

IMPORTANT NOTES:
- The targets definition is automatically passed to CLARK in step II. The file has been
computed by set_targets.sh.

- The script set_targets.sh assumes that each reference file from bacteria, ..., fungi
or custom database contains an accession numberr (in the RefSeq records format:
i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ..." ).
If an accession number is missing in a file then this file will not be used
for the classification.

- set_targets.sh also maps the accession number found in each reference sequence to its
taxonomy ID based on the latest NCBI taxonomy data. If a mapping cannot be made for a
given sequence, then it will NOT be counted and excluded from the targets definition.
The total number of excluded files is prompted in the standard output, and all files that
have been excluded are reported in the file "files_excluded.txt" (located in the specified
database directory (i.e., "./DBD/").
If some files are excluded, then it probably means that they have been removed for
curations for example (visit the RefSeq FAQ webpage) or maybe because your local taxonomy
information are no nore up-to-date (see next point).

- You can update your local taxonomy database thanks to the script "updateTaxonomy.sh"
You can use this script before running set_targets.sh.

- If the user wants to work with a different customized database (for example, by removing
or adding more sequences of interest in the Custom folder) then the targets definition
must be reset. We made it simple with the script "resetCustomDB.sh":
After the sequences in the Custom folder have been updated, just run:
$ ./resetCustomDB.sh
Then, run set_target.sh with the desired settings.

- The database files (*.ky, *.lb and *.sz) will be created inside a subdirectory of the
specified database directory in step I (i.e., "./DBD/") by classify_metagenome.sh.

- The default values (the k-mer length, the mode, the number of threads, etc.) are used
if not specified by the user, just like indicated in the previous section.

- The script classify_metagenome.sh still allows you to pass customized parameters and
options, similarly to the previous section. classify_metagenome.sh follows options defined
in "MANUAL & OPTIONS"(see below some examples). So you can change the k-mer length,
the number of parallel threads, mode, etc.

We present below some examples of customized classification using classify_metagenome.sh.

a) To use 20-mers (instead of 31-mers):
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -k 20

b) To request the full mode:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0

c) To classify in full mode multiple sample files (single-end reads):
$ ./classify_metagenome.sh -O ./samples.txt -R ./samples.txt -m 0

where, the file "samples.txt" contains the addresses of all the sample files to be run:
$ cat samples.txt
./sample1.fa
./sample2.fa
./sample3.fa
./sample4.fastq
./sample5.fq
./sample6.fq
...

d) To classify in full mode multiple sample files (paired-end reads):
$ ./classify_metagenome.sh -O ./samples.R.txt ./samples.L.txt -R ./samples.R.txt -m 0

where, files "samples.R.txt" and "samples.L.txt" contain the addresses of all the fastq
files (right and left) to be run:
$ cat samples.R.txt
./sample1.R1.fq
./sample2.R1.fq
./sample3.R1.fq
...

$ cat samples.L.txt
./sample1.R2.fq
./sample2.R2.fq
./sample3.R2.fq
...

Observe in this example that the order of right and left paired-end reads of each sample
must be preserved in "samples.R.txt" and "samples.L.txt" .

e) To request the express mode, and 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 2 -n 8

f) To request the full mode, with gzipped objects file, and using 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 0 -n 8 --gzipped

Another example, in default mode, for classifying paired-end reads (./sample1.fastq
and ./sample2.fastq):
$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result

Notes:
This script can run CLARK-l instead of CLARK, for workstations with limited RAM.
Then, the user can indicate it with the option "--light". For example:

$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --light

This script can run CLARK-S instead of CLARK, if the database files of discriminative
of spaced k-mers have been built. To use CLARK-S, the option is "--spaced".
For example:

$ ./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --spaced

g) To run CLARK-S with full mode and using 8 threads:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result -m 0 -n 8 --gzipped --spaced

h) To run CLARK-S with express mode and using 8 threads on a gzipped file:
$ ./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 2 -n 8 --spaced

If you want to run CLARK-S but with a much lower RAM usage then you can decide
to download only half the discriminative spaced k-mers in memory using "-s 2".
For example:
$ ./classify_metagenome.sh -O ./sample.fa -R ./result --spaced -s 2

Note:
run "./classify_metagenome.sh" in the terminal to prompt the help/usage describing
options and parameters.

2) Abundance estimation

The script "estimate_abundance.sh" can analyze CLARK results of a metagenomic sample,
and can provide for each target identified, its scientific name, taxonomy id, lineage
(superkingdom, phylum, class, order, family, genus), its count and proportion of objects
assigned to it. This script also allows to apply some filtering conditions (based on
the confidence score or gamma score of assignments) to obtain a stricter estimation.
The output format of estimate_abundance.sh is CSV.

For example, say a metagenomic sample contains 100 reads, results by CLARK (full mode)
indicates that 20 reads (20%) are assigned to the target T1, 70 (70%) are assigned
to T2, and 10 (10%) are not assigned ("NA"). Thus, the abundance estimation program
reports a count of 20 (or 22.2% of the assigned reads) for T1, 70 (or 77.7%) for T2,
and a count of 10 for the category "UNKNOWN" (collecting all unassigned reads or
assignments that do not satisfy filtering conditions).
We give below examples of command lines.

Parameters and options of this script are:

estimate_abundance.sh -c -g -D -F .csv .csv ... .csv -a ...

Definition of parameters:

-c To filter assignments based on their confidence score (if available)
using the threshold value minConfidenceScore (a value between 0.5 and 1.0).
The abundance estimation for each target will count only
assignments with a confidence score higher than minConfidenceScore.
Assignments that have a confidence score lower than minConfidenceScore
will be marked as unclassified and so counted in the
category UNKNOWN in the abundance estimation report.
The default value is 0.5.

-g To filter assignments based on their gamma score (if available) using the
threshold value minGamma (a value between 0 and 1.0).
The abundance estimation for each target will count only
assignments with a gamma score higher than minGamma.
Assignments that have a gamma score lower than minGamma
will be marked as unclassified and so counted in the
category UNKNOWN in the abundance estimation report.
The default value is 0.

-D The directory path of the database (the same you indicated when calling
set_targets.sh). This parameter is required to load scientific names of
all targets ONLY if you pass results of a metagenomic sample.

-F .csv ... .csv results file or list of results files produced by CLARK.
Important Note: You can pass a results file produced by any mode of
execution of CLARK (full, express, spectrum, default),
but if you pass several files, make sure they all have
been produced under the same mode. For example, if you
pass two files result1.csv and result2.csv then result1.csv
and result2.csv should be produced through the same mode
(e.g., both by full mode).

-a To filter abundance estimations that are higher than a certain threshold,
minAbundance. minAbundance is a percentage value (between 0 and 100).

--highconfidence To automatically count only high confidence assignments for the
abundance estimation. This option is equivalent to "-c 0.75 -g 0.03".

--krona To export results in a 3-column file for the Krona tool
(Hierarchical data browser, Ondov et al, BMC Bioinformatics, 2011,
doi: 10.1186/1471-2105-12-385. https://github.com/marbl/Krona/wiki).
With this option on, the program will create the file 'results.krn' in the
working directory containing the unnormalized abundance from CLARK's
assignments (with the filtering options if any).
Then, to create the Krona diagram, call the executable ktImportTaxonomy (from
your Krona directory) with the option '-m 3'. For example:
$ ktImportTaxonomy -o results.html -m 3 results.krn

--mpa To export results in a tab-delimited (two-column) file, similar to
MetaPhlAn's output. When this option is specified this MetaPhlAn-like reporting
is stored in the file "results.mpa" in the working directory.

Examples:

i) To get the abundance estimation of the results file, ./result.csv:

$ ./estimate_abundance.sh -F ./result.csv -D

where is the directory of the database you have set with set_targets.sh.

To store its output in a file (e.g., "abundance.csv"), you can do by:

$ ./estimate_abundance.sh -F ./result.csv -D > abundance.csv

ii) To filter high confidence assignments for the abundance estimation (i.e.,
low confidence assignments will then be reported in the UNKNOWN category):

$ ./estimate_abundance.sh -F ./result.csv -D --highconfidence

iii) To filter assignnments by using a certain confidence score threshold
(for example, 0.8) :

$ ./estimate_abundance.sh -F ./result.csv -D -c 0.80

To filter assignnments using a gamma score threshold (for example, -g 0.03)

$ ./estimate_abundance.sh -F ./result.csv -D -g 0.03

To output the results of the previous command in the mpa format:

$ ./estimate_abundance.sh -F ./result.csv -D -g 0.03 --mpa

Note:
Filtering based on the confidence score and/or gamma on a results file is possible
only if the file was built using the full or spectrum mode. Indeed, default
and express mode do not produce confidence scores or gamma scores.

IMPORTANT NOTE:
From the v1.2.3, results of CLARK-S are automatically filtered. Assignments with
low confidence score (i.e., < 0.75) or low gamma value (i.e., < 0.06) are considered
as low confidence assignments and thus marked as unknown or "NA".
In other words, only high confidence assignments are kept to guarantee both
high sensitivity and high precision.

iv) If your result files are large (i.e., contain a lot of targets) then the
output of the abundance estimation program can be large too. You can filter
abundances by setting a minimum abundance percentage.

For example, to print out abundances higher than 2% only, then:

$ ./estimate_abundance.sh -F ./result.csv -D -a 2

Finally, you can pass several results files at the same time:

$ ./estimate_abundance.sh -F ./result1.csv ./result2.csv ... -D

In that case, to assure consistency, the results files SHOULD be produced by
the same mode (see parameters and options above). Results from different
modes of execution SHOULD NOT be mixed.

We also remind the user that another script manipulating results files
is offered, "evaluate_density_confidence.sh" (resp. "evaluate_density_gamma.sh").
This script estimates and plots the density of assignments per confidence score
(resp. gamma score). It takes in input results file(s) produced
in the full/spectrum mode only.

== RESULTS FORMAT

Results are in CSV format. Confidence scores and other statistics about assignments
are computed only in the full/spectrum mode.

In the full mode (extended), the results format is the following for each line:
,,...,,,,,,,, where :
* the "Object_ID" is the tag name indicated in the header (after ">" or "@") for each
object, and N is the number of targets.
In the spectrum mode, this tag name is merely the filename.
* hit count in target i is the number of k-mers specific to target i that are in the object
(if the option "--extended" is on).
* Length of object is the number of bases (A,C,G,T,U or N) the object has.
* Gamma is the ratio between the total number of hit found in the object
(against all targets) and the number of k-mers in the object.
* first assignment is the target ID of the target that obtained the highest hit count
(ties are broken arbitrarily: the 1st assignment is the target
having the first hit).
* hit count of first is the number of hit count for the first assignment (h1).
* second assignment is the target ID of the target that obtained the second highest hit
count (ties are broken arbitrarily).
* hit count of second is the number of hit count for the second assignment (h2).
* confidence score is the ratio : h1/(h1+h2).

In the default or express mode, the results format is the following for each line:
,,<1st_assignment>, where :

* the "Object_ID" is the tag name indicated in the header
(after ">" or "@") for each object.
* Length of object is the number of bases the object has.
* 1st assignment is the target ID of the target that obtained the highest hit count
(ties are broken arbitrarily: the 1st assignment is the
target having the first hit).

== VERSIONS

Version 1.2.6.1 May 11, 2019.
Version 1.2.6 February 21, 2019.
Version 1.2.5.1 September 20, 2018.
Version 1.2.5 February 13, 2018.
Version 1.2.4 November 25, 2017.
Version 1.2.3.2 August 21, 2017.
Version 1.2.3.1 February 27, 2017.
Version 1.2.3 April 7, 2016.
Version 1.2.2-b December 11, 2015.
Version 1.2.1-b June 15, 2015.
Version 1.1.3 June 3, 2015.
Version 1.1.2 April 22, 2015.
Version 1.1.1 April 15, 2015.
Version 1.1. February 20, 2015.
Version 1.0. September 01, 2014.

== CONTACT/INFO

For any feedback, suggestion, or question, please feel free to contact Rachid Ounit
(clark.ucr.help at gmail.com).

References:

Copy a file that links sequences to known classifications. In this case the known sequences are 10 randomly selected bacterial genomes.

In the following example we'll use more reads simulated from slighly more complex set of genomes (21 genomes from 5 genera). We'll also use some CLARK helper scripts

The last line creates a hierachical visualization in a self contained html file.

Create a batch input file (e.g. clark.sh), which uses the input file 'clark.in'. For example:

Create a swarmfile (e.g. clark.swarm). Assuming that a database has been created with set_targets.sh already:

-g #	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t #	Number of threads/CPUs required for each process (1 line in the swarm command file).
--module clark	Loads the clark module for each subjob in the swarm