Skip to content

Workflow


This section describes the methodology used by SigProfilerClusters to identify and classify clustered mutations. The tool follows a four-stage process that distinguishes genuinely clustered mutations from those distributed randomly across the genome.

Minimum mutational burden

Reliable analysis requires a mutational burden greater than 0.005 mutations per megabase. Samples below this threshold may not produce meaningful IMD-based clustering results.


Stage 1: Background Model Generation

A per-sample background model is created using SigProfilerSimulator. This tool randomizes each mutation across the genome while preserving:

  • The desired trinucleotide sequence context
  • Transcriptional strand bias ratios
  • The mutational burden on each chromosome

A minimum of 100 simulations is recommended to ensure a robust background distribution.

Stage 2: IMD Threshold Calculation

SigProfilerClusters compares the distribution of real inter-mutational distances (IMD) with the simulated background to compute a sample-specific IMD threshold. The threshold is set at the point where at least 90% of mutations falling within it are unlikely to have occurred by chance, i.e., their spacing cannot be explained by the background model.

This approach ensures that the clustering criterion adapts to the mutation burden and genomic context of each individual sample.

Stage 3: Regional Correction

The global IMD threshold is further refined by applying a localized correction across 1-megabase genomic windows. This step accounts for the uneven distribution of mutations throughout the genome — regions with higher mutation density will have a proportionally adjusted threshold — preventing over- or under-calling of clustered events in heterogeneous mutational landscapes.

Stage 4: Mutation Classification

After thresholding and correction, mutations are assigned to one of the following categories based on their inter-mutational spacing and, when available, variant allele frequencies (VAF):

Class Category Description
Non-clustered Mutations whose spacing is consistent with random expectation
Clustered DBS (Doublet Base Substitution) Exactly 2 adjacent SNVs on the same chromosomal position
Clustered MBS (Multi-Base Substitution) 3 or more adjacent SNVs on consecutive chromosomal positions
Clustered Omikli 2–3 clustered SNVs with IMD pattern consistent with a localized event
Clustered Kataegis 4 or more clustered SNVs with IMD pattern consistent with a regional hypermutation event
Clustered Other Clustered mutations with inconsistent allele frequencies; only identified when VAF/CCF data is available

Indels are classified as clustered or non-clustered but are not further subclassified into the categories above.