Skip to content

Using SigProfilerClusters


This section describes SigProfilerClusters' main function for clustered mutation analysis, the accepted input file formats, and all available parameters.


Function

The main function in SigProfilerClusters is analysis. It partitions somatic mutations into clustered and non-clustered groups and, optionally, subclassifies clustered SNVs into DBS, MBS, Omikli, and Kataegis events.

Input files

SigProfilerClusters accepts four input file types. All input files for a given project must be placed in the same directory (input_path):

  • VCF — one file per sample, with the sample ID as the filename.
  • MAF — standard Mutation Annotation Format file.
  • Simple text file — tab-delimited plain text format, as described in SigProfilerMatrixGenerator.
  • ICGC Format — ICGC simple somatic mutation format.

Additionally, a background model generated by SigProfilerSimulator must be present in the same project directory before running the analysis function.

Running the function

First, start a Python interactive shell and import SigProfilerClusters:

$ python
>>> from SigProfilerClusters import SigProfilerClusters as hp

Then call the analysis function with the required parameters:

>>> hp.analysis(project, genome, contexts, simContext, input_path)

You can also run SigProfilerClusters from the command line:

$ SigProfilerClusters analysis project genome contexts simContext input_path

Required parameters

Parameter Variable Type Parameter Description
project String Unique name for the given project
genome String Reference genome to use. Must be installed using SigProfilerMatrixGenerator. Supported genomes: GRCh37, GRCh38, mm9, mm10
contexts String Mutational context for SNV analysis. Accepted values: "96" (substitutions) or "ID" (indels)
simContext List of Strings Mutation context used when generating the background model with SigProfilerSimulator. For example: ["288"], ["6144"], or ["96"]
input_path String Path to the directory containing the input files. Must end with / (e.g., "path/to/the/input_file/")

Optional parameters

Parameter Variable Type Parameter Description
analysis String Analysis pipeline to run. Options: "all" (default), "subClassify", "hotspot"
sortSims Boolean Sort simulated files to ensure accuracy. Default: True
interdistance String Mutation types for which to calculate IMDs. Use only for indel analysis. Default: "ID"
calculateIMD Boolean Whether to calculate IMDs. Set to False to rerun subclassification only. Default: True
max_cpu Integer Number of CPUs to use for parallel processing. Default: all available CPUs
subClassify Boolean Subclassify clustered mutations into DBS, MBS, Omikli, and Kataegis (requires VAF scores in TCGA/Sanger format if includedVAFs=True). Default: False
plotIMDfigure Boolean Generate IMD and mutational spectra plots for each sample. Default: True
plotRainfall Boolean Generate rainfall plots using subclassified clustered events. Default: True

Parameters used when subClassify=True

Parameter Variable Type Parameter Description
includedVAFs Boolean Indicates that VAF scores are present in the input files and should be used for subclassification. Default: True
includedCCFs Boolean Indicates that Cancer Cell Fraction (CCF) estimates are used instead of VAFs. When True, set includedVAFs=False. Default: True
variant_caller String Format of the VAF column in the input VCF. Accepted values: "standard" (default), "caveman", "mutect2"
windowSize Integer Window size (in bp) for calculating mutation density in rainfall plots. Default: 10000000
correction Boolean Apply genome-wide mutational density correction to the IMD threshold. Default: False
probability Boolean Calculate the probability of observing each clustered event within its local genomic region. Results are appended as an extra column in the output VCF files under [project_path]/output/clustered/. Default: False

VAF Formats

SigProfilerClusters uses variant allele frequencies (VAF) to subclassify clustered mutations when subClassify=True and includedVAFs=True. The VAF may be recorded in different formats depending on the variant caller used. Select the appropriate variant_caller value using the table below:

variant_caller value VAF column location
"caveman" 11th column; last colon-delimited value (e.g., 0:0:0.25)
"standard" 8th or 10th column as VAF=xx or AF=xx
"mutect2" 10th or 11th column as AF=xx

No VAF available

If your input files contain no VAF information, set includedVAFs=False. SigProfilerClusters will still subclassify clusters based solely on the calculated IMD values (when subClassify=True).

Non-VCF input

If you are using a non-VCF input format (MAF, simple text, ICGC), VAFs cannot be extracted from the file. In this case, set subClassify=True and includedVAFs=False.

CCF Format

As an alternative to VAFs, SigProfilerClusters accepts cancer cell fraction (CCF) estimates to correct for copy number amplifications. To use CCFs:

  • Set includedCCFs=True and includedVAFs=False.
  • Add the CCF value as a tab-separated extra column at the end of each variant row in the input VCF file.