Frequently Asked Questions
Vclust is a tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards set by the International Committee on Taxonomy of Viruses (ICTV) and the Minimum Information about an Uncultivated Virus Genome (MIUViG).
Therefore, Vclust can be used in:
- taxonomic classification of viruses into putative species and genera
- assingment of contigs into virus operational taxonomic units (vOTUs)
- dereplication of virome datasets (i.e., identifing groups of highly similar genomes and reducing each group to a single representative genome).
Vclust is an alignment-based tool that align genome sequence pairs, calculate their sequence similarity measures (e.g., ANI, alignment fraction) and perform clusterings. Specifically, this is done in three steps:
- Prefilter: reduces the number of potential genome pairs to only those with sufficient k-mer-based sequence similarity (i.e., minimum number of common k-mers and/or the minimum sequence identity of the shorter sequence)
- Alignment: performs pairwise sequence alignments and compute sequence similarity measures between genome pairs identified in the prefilter step.
- Cluster: groups viral genomes based on defined sequence similarity cutoffs
Vclust provides six similarity measures between two genome sequences:
- ANI: number of identical bases across local alignments divided by the total length of the alignments.
- Global ANI (gANI): number of identical bases across local alignments divided by the length of the query/target genome.
- Total ANI (tANI): number of identical bases between query-target and target-query genomes divided by the sum length of both genomes. tANI is equivalent to VIRIDIC's intergenomic similarity.
- Coverage (alignment fraction): proportion of query sequence that is aligned with target sequence.
- Number of local alignments
- Ratio between query and target genome lengths
Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.
- Single-linkage
- Complete-linkage
- UCLUST
- CD-HIT (Greedy incremental)
- Greedy set cover (adopted from MMseqs2)
- Leiden algorithm
Yes, Vclust accurately estimates ANI. Compared to other methods, Vclust uses a sensitive alignment method based on the Lempel-Ziv parsing. Vclust has superior accuracy in comparisons with BLAST-based tools like VIRIDIC, both on simulated and real virus genomes, and shows higher accordance with the ICTV taxonomy.
Zielezinski A, Gudys A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv. 2024. [paper]