Workflow¶

This section describes the methodology used by SigProfilerClusters to identify and classify clustered mutations. The tool follows a four-stage process that distinguishes genuinely clustered mutations from those distributed randomly across the genome.

Minimum mutational burden

Reliable analysis requires a mutational burden greater than 0.005 mutations per megabase. Samples below this threshold may not produce meaningful IMD-based clustering results.

Stage 1: Background Model Generation¶

A per-sample background model is created using SigProfilerSimulator. This tool randomizes each mutation across the genome while preserving:

The desired trinucleotide sequence context
Transcriptional strand bias ratios
The mutational burden on each chromosome

A minimum of 100 simulations is recommended to ensure a robust background distribution.

Stage 2: IMD Threshold Calculation¶

SigProfilerClusters compares the distribution of real inter-mutational distances (IMD) with the simulated background to compute a sample-specific IMD threshold. The threshold is set at the point where at least 90% of mutations falling within it are unlikely to have occurred by chance, i.e., their spacing cannot be explained by the background model.

This approach ensures that the clustering criterion adapts to the mutation burden and genomic context of each individual sample.

Stage 3: Regional Correction¶

The global IMD threshold is further refined by applying a localized correction across 1-megabase genomic windows. This step accounts for the uneven distribution of mutations throughout the genome — regions with higher mutation density will have a proportionally adjusted threshold — preventing over- or under-calling of clustered events in heterogeneous mutational landscapes.

Stage 4: Mutation Classification¶

After thresholding and correction, mutations are assigned to one of the following categories based on their inter-mutational spacing and, when available, variant allele frequencies (VAF):

Class	Category	Description
Non-clustered	—	Mutations whose spacing is consistent with random expectation
Clustered	DBS (Doublet Base Substitution)	Exactly 2 adjacent SNVs on the same chromosomal position
Clustered	MBS (Multi-Base Substitution)	3 or more adjacent SNVs on consecutive chromosomal positions
Clustered	Omikli	2–3 clustered SNVs with IMD pattern consistent with a localized event
Clustered	Kataegis	4 or more clustered SNVs with IMD pattern consistent with a regional hypermutation event
Clustered	Other	Clustered mutations with inconsistent allele frequencies; only identified when VAF/CCF data is available

Indels are classified as clustered or non-clustered but are not further subclassified into the categories above.