What analyses does it do?

Pipeline overview

CompareM2 runs a directed acyclic graph (DAG) of interdependent analysis rules. Each rule performs a specific bioinformatic task on one or more input genomes. Rules form chains where outputs feed into downstream analyses.

dag2 pdf

[open figure]

The pipeline flows from top (copying input genomes) to bottom (collecting results into the report).

N-dependent output selection

CompareM2 automatically selects which analyses to run based on the number of input genomes (N):

N	Analyses enabled
0	Database downloads only
1+	Per-genome analyses: annotation, QC, functional annotation, metabolic modeling
2+	Pairwise comparisons: panaroo, snp-dists, mashtree, treecluster
3+	Phylogenetics: fasttree, iqtree, bootstrap mashtree

Running specific analyses

Use --until to run only specific rules and their dependencies:

comparem2 --until panaroo

This runs only what is needed to produce the panaroo output: copy, annotate, and panaroo itself.

Note

Use comparem2 --until <rule> [<rule2>...] to run one or more specific analyses. Rule names are listed below.

Included analyses

Quality control

`assembly_stats` — Assembly-stats

Computes assembly statistics including N50 (the length of the smallest contig that, together with the longer contigs, covers at least half of the genome), total assembly length, number of contigs, and other summary metrics.

Requires: N ≥ 1. No database download needed.

`sequence_lengths` — SeqKit

Extracts per-contig lengths and GC content from each input genome. The report visualizes each contig as a bar colored by GC content, giving a quick overview of assembly fragmentation and composition bias.

Requires: N ≥ 1. No database download needed.

`checkm2` — CheckM2

Estimates genome completeness and contamination using machine learning on a universal gene set. Essential for assessing the quality of metagenome-assembled genomes (MAGs).

Requires: N ≥ 1. Downloads the CheckM2 DIAMOND database (~3.5 GB) on first run.

Annotation

The annotation output is used by many downstream tools (eggNOG, dbCAN, InterProScan, Panaroo, etc.). Choose one annotator via the annotator config setting. NCBI-sourced genomes automatically use their bundled annotation instead.

`bakta` — Bakta (default)

Rapid and standardized annotation of bacterial genomes. Bakta uses a comprehensive pre-built database and produces consistent locus tags suitable for comparative analyses.

Requires: N ≥ 1. Downloads the Bakta database on first run.

Parameter	Default	Description
`set_bakta--translation-table`	`11`	Genetic code translation table
`set_bakta--gram`	`"?"`	Gram type for signal peptide prediction (`+`, `-`, or `?`)
`set_bakta--meta`	(unset)	Enable metagenome mode (flag)

`prokka` — Prokka

Whole genome annotation for bacteria and archaea. Alternative to Bakta — select with annotator: "prokka" in config.

Requires: N ≥ 1. No database download needed.

Parameter	Default	Description
`set_prokka--compliant`	(flag set)	Force Genbank/ENA/DDJB compliance
`set_prokka--kingdom`	`bacteria`	Annotation kingdom (`archaea`, `bacteria`, `mitochondria`, `viruses`)
`set_prokka--gram`	(unset)	Gram type (`neg`, `pos`)
`set_prokka--rfam`	(unset)	Enable Rfam search for ncRNAs (flag)

Functional annotation

`eggnog` — eggNOG-mapper

Functional annotation through orthology assignment. Maps predicted proteins against a database of pre-computed orthologous groups to transfer functional information including COG categories, KEGG orthologs, and Gene Ontology terms.

Requires: N ≥ 1. Downloads the eggNOG database on first run.

Parameter	Default	Description
`set_eggnog-m`	`diamond`	Search mode (`diamond`, `mmseqs`, `hmmer`)

`interproscan` — InterProScan

Classifies proteins into families and predicts domains and important sites using multiple member databases (TIGRFAM, Hamap, Pfam by default).

Requires: N ≥ 1. No database download needed (InterProScan bundles its own data).

Parameter	Default	Description
`set_interproscan--applications`	`TIGRFAM,Hamap,Pfam`	Comma-separated list of member databases to run
`set_interproscan--goterms`	(flag set)	Include Gene Ontology terms
`set_interproscan--pathways`	(flag set)	Include pathway annotations

`dbcan` — dbCAN

Annotates carbohydrate-active enzymes (CAZymes) by searching against the dbCAN HMM and DIAMOND databases. Useful for studying carbohydrate metabolism, degradation, and biosynthesis capabilities.

Requires: N ≥ 1. Downloads the dbCAN database on first run.

`kegg_pathway` — clusterProfiler

KEGG pathway enrichment analysis. Predicted proteins are searched against the UniRef100-KO database (≥85% coverage, ≥50% identity), and clusterProfiler's enricher function computes Benjamini-Hochberg adjusted p-values for pathway enrichment per genome.

Requires: N ≥ 1. Uses the CheckM2 DIAMOND database (shared download).

`amrfinder` — AMRFinderPlus

Identifies antimicrobial resistance genes, point mutations, and virulence and stress resistance genes in assembled nucleotide and protein sequences using NCBI's curated reference database.

Requires: N ≥ 1. No separate database download needed.

`mlst` — MLST

Multi-locus sequence typing using the PubMLST database. Automatically detects the appropriate MLST scheme for each genome and assigns a sequence type.

Requires: N ≥ 1. No separate database download needed.

Parameter	Default	Description
`set_mlst--scheme`	(auto-detect)	Force a specific MLST scheme (e.g., `efaecium`, `saureus`)

`gapseq_find` / `gapseq_fill` — gapseq

Predicts metabolic pathways (gapseq_find) and reconstructs gap-filled genome-scale metabolic models (gapseq_fill). The two-step process first identifies pathways and transporters, then fills gaps in the metabolic network to produce a functional model.

Requires: N ≥ 1. No separate database download needed.

Parameter	Default	Description
`set_gapseq_find-t`	`auto`	Taxonomic range for reference sequences (`Bacteria`, `Archaea`, `auto`)
`set_gapseq_fill_draft-b`	`auto`	Biomass reaction to use

`antismash` — antiSMASH

Detects and characterizes biosynthetic gene clusters (BGCs) for secondary metabolites including antibiotics, siderophores, and terpenes.

Requires: N ≥ 1. Downloads the antiSMASH database on first run.

`carveme` — CarveMe

Automated reconstruction of genome-scale metabolic models from annotated genomes. Produces SBML models suitable for flux balance analysis.

Requires: N ≥ 1. No separate database download needed.

Parameter	Default	Description
`set_carveme--gapfill`	`LB`	Growth media for gap-filling (`M9`, `LB`, or comma-separated)
`set_carveme--solver`	`scip`	LP solver to use

Core and pan genomes

`panaroo` — Panaroo

Computes the pan and core genome across input genomes. The core genome contains genes conserved across all (or nearly all) samples, while the pan genome is the union of all genes. Panaroo also produces a core genome alignment used by downstream phylogenetic tools.

Requires: N ≥ 2.

Parameter	Default	Description
`set_panaroo--clean-mode`	`sensitive`	Error-correction stringency (`strict`, `moderate`, `sensitive`)
`set_panaroo--core_threshold`	`0.95`	Fraction of samples a gene must appear in to be considered "core"
`set_panaroo--threshold`	`0.98`	Sequence identity threshold for clustering
`set_panaroo-a`	`core`	Alignment output type (`core`, `pan`)
`set_panaroo-f`	`0.7`	Protein family sequence identity threshold
`set_panaroo--remove-invalid-genes`	(flag set)	Exclude genes with unusual length or premature stop codons

Phylogenetics and taxonomy

`mashtree` — Mashtree

Computes an approximation of ANI using the MinHash distance measure and builds a neighbor-joining tree. Fast enough for hundreds of genomes. The resulting tree is unrooted.

Requires: N ≥ 2.

Parameter	Default	Description
`set_mashtree--genomesize`	`5000000`	Expected genome size (bp)
`set_mashtree--mindepth`	`5`	Minimum k-mer depth
`set_mashtree--kmerlength`	`21`	K-mer length
`set_mashtree--sketch-size`	`10000`	Sketch size for MinHash

`bootstrap_mashtree` — Mashtree

Mashtree with bootstrap support values. Inherits all set_mashtree parameters from above.

Requires: N ≥ 3.

Parameter	Default	Description
`set_bootstrap_mashtree--reps`	`100`	Number of bootstrap replicates

`fasttree` — FastTree

Builds an approximately-maximum-likelihood phylogenetic tree from the core genome alignment produced by Panaroo. Faster than IQ-TREE but with less rigorous statistical support.

Requires: N ≥ 3. Depends on Panaroo core genome alignment.

Parameter	Default	Description
`set_fasttree-gtr`	(flag set)	Use the generalized time-reversible (GTR) model

`iqtree` — IQ-TREE

Maximum-likelihood phylogenetic inference with bootstrap support from the core genome alignment. More thorough than FastTree, providing formal model selection and statistical branch support.

Requires: N ≥ 3. Depends on Panaroo core genome alignment.

Parameter	Default	Description
`set_iqtree--boot`	`100`	Number of bootstrap replicates
`set_iqtree-m`	`GTR`	Substitution model

`gtdbtk` — GTDB-Tk

Taxonomic classification using the Genome Taxonomy Database (GTDB). Assigns species names by measuring average nucleotide identity (ANI) and relative evolutionary divergence (RED) against reference sequences.

Requires: N ≥ 1. Downloads the GTDB database (~85 GB) on first run.

Parameter	Default	Description
`set_gtdbtk--keep_intermediates`	(flag set)	Retain intermediate files

`snp_dists` — snp-dists

Counts pairwise SNP differences on the core genome alignment. Note: SNP distances are not adjusted for transition/transversion bias and give a ballpark indication of divergence rather than a true evolutionary distance. Highly sensitive to the core/pan genome size ratio.

Requires: N ≥ 2. Depends on Panaroo core genome alignment.

`treecluster` — TreeCluster

Clusters genomes on a phylogenetic tree using a distance threshold. Useful for defining operational taxonomic units or outbreak clusters.

Requires: N ≥ 2. Runs on the Mashtree output.

Parameter	Default	Description
`set_treecluster--method`	`max_clade`	Clustering method (see TreeCluster docs for all options)
`set_treecluster--threshold`	`0.05`	Distance threshold for cluster assignment

Dynamic report

The report is always generated and collects results from all completed analyses. Only sections for tools that ran successfully are included.

report — A portable HTML report with interpretable results and publication-ready graphics. See demo reports below.

Passthrough parameters

Any tool parameter can be forwarded from the CompareM2 config using the set_ prefix. The naming convention is set_<tool><flag>: <value>. Flag-only arguments (no value) use an empty string "".

For example, to change the IQ-TREE substitution model and increase bootstrap replicates:

# In config/config.yaml or via --config
set_iqtree-m: GTR+G
set_iqtree--boot: 1000

Or on the command line:

comparem2 --config set_iqtree-m=GTR+G set_iqtree--boot=1000

To add a flag argument (no value), set it to an empty string:

comparem2 --config 'set_prokka--rfam=""'

To remove a default passthrough parameter, delete (or comment out) the corresponding line in config/config.yaml. Default values for all passthrough parameters are listed in the tool sections above.

Note

Check each tool's own documentation (linked above) for the full list of available flags.

Pseudo-rules

Pseudo-rules are shortcuts to run curated subsets of the pipeline:

Pseudo-rule	Included analyses
`fast`	sequence_lengths, assembly-stats, mashtree
`meta`	annotation, assembly-stats, sequence_lengths, checkm2, eggnog, kegg_pathway, dbcan, interproscan, gtdbtk, mashtree
`isolate`	annotation, assembly-stats, sequence_lengths, eggnog, kegg_pathway, gtdbtk, mlst, amrfinder, panaroo, fasttree, snp-dists, mashtree
`downloads`	All database download rules
`report`	Re-render the report

Hint

Run a pseudo-rule like any other rule: comparem2 --until meta or comparem2 --until isolate

Rendered report

These demo reports are available for download:

report_strachan_campylo.html — 32 Campylobacter genomes from Strachan et al. (Nature 2022, doi.org/10.1038/s41564-022-01300-y). Metagenome and genome sequencing from the rumen epithelial wall of dairy cattle.
report_Methanoflorens.html — 6 Methanoflorens (archaeal) genomes. Representatives of Bog-38 from GTDB.

What analyses does it do?

Pipeline overview

N-dependent output selection

Running specific analyses

Included analyses

Quality control

assembly_stats — Assembly-stats

sequence_lengths — SeqKit

checkm2 — CheckM2

Annotation

bakta — Bakta (default)

prokka — Prokka

Functional annotation

eggnog — eggNOG-mapper

interproscan — InterProScan

dbcan — dbCAN

kegg_pathway — clusterProfiler

amrfinder — AMRFinderPlus

mlst — MLST

gapseq_find / gapseq_fill — gapseq

antismash — antiSMASH

carveme — CarveMe

Core and pan genomes

panaroo — Panaroo

Phylogenetics and taxonomy

mashtree — Mashtree

bootstrap_mashtree — Mashtree

fasttree — FastTree

iqtree — IQ-TREE

gtdbtk — GTDB-Tk

snp_dists — snp-dists

treecluster — TreeCluster

Dynamic report

Passthrough parameters

Pseudo-rules

Rendered report

`assembly_stats` — Assembly-stats

`sequence_lengths` — SeqKit

`checkm2` — CheckM2

`bakta` — Bakta (default)

`prokka` — Prokka

`eggnog` — eggNOG-mapper

`interproscan` — InterProScan

`dbcan` — dbCAN

`kegg_pathway` — clusterProfiler

`amrfinder` — AMRFinderPlus

`mlst` — MLST

`gapseq_find` / `gapseq_fill` — gapseq

`antismash` — antiSMASH

`carveme` — CarveMe

`panaroo` — Panaroo

`mashtree` — Mashtree

`bootstrap_mashtree` — Mashtree

`fasttree` — FastTree

`iqtree` — IQ-TREE

`gtdbtk` — GTDB-Tk

`snp_dists` — snp-dists

`treecluster` — TreeCluster