Using SigProfilerClusters¶
This section describes SigProfilerClusters' main function for clustered mutation analysis, the accepted input file formats, and all available parameters.
Function¶
The main function in SigProfilerClusters is analysis. It partitions somatic mutations into clustered and non-clustered groups and, optionally, subclassifies clustered SNVs into DBS, MBS, Omikli, and Kataegis events.
Input files¶
SigProfilerClusters accepts four input file types. All input files for a given project must be placed in the same directory (input_path):
- VCF — one file per sample, with the sample ID as the filename.
- MAF — standard Mutation Annotation Format file.
- Simple text file — tab-delimited plain text format, as described in SigProfilerMatrixGenerator.
- ICGC Format — ICGC simple somatic mutation format.
Additionally, a background model generated by SigProfilerSimulator must be present in the same project directory before running the analysis function.
Running the function¶
First, start a Python interactive shell and import SigProfilerClusters:
$ python
>>> from SigProfilerClusters import SigProfilerClusters as hp
Then call the analysis function with the required parameters:
>>> hp.analysis(project, genome, contexts, simContext, input_path)
You can also run SigProfilerClusters from the command line:
$ SigProfilerClusters analysis project genome contexts simContext input_path
Required parameters¶
| Parameter | Variable Type | Parameter Description |
|---|---|---|
project |
String | Unique name for the given project |
genome |
String | Reference genome to use. Must be installed using SigProfilerMatrixGenerator. Supported genomes: GRCh37, GRCh38, mm9, mm10 |
contexts |
String | Mutational context for SNV analysis. Accepted values: "96" (substitutions) or "ID" (indels) |
simContext |
List of Strings | Mutation context used when generating the background model with SigProfilerSimulator. For example: ["288"], ["6144"], or ["96"] |
input_path |
String | Path to the directory containing the input files. Must end with / (e.g., "path/to/the/input_file/") |
Optional parameters¶
| Parameter | Variable Type | Parameter Description |
|---|---|---|
analysis |
String | Analysis pipeline to run. Options: "all" (default), "subClassify", "hotspot" |
sortSims |
Boolean | Sort simulated files to ensure accuracy. Default: True |
interdistance |
String | Mutation types for which to calculate IMDs. Use only for indel analysis. Default: "ID" |
calculateIMD |
Boolean | Whether to calculate IMDs. Set to False to rerun subclassification only. Default: True |
max_cpu |
Integer | Number of CPUs to use for parallel processing. Default: all available CPUs |
subClassify |
Boolean | Subclassify clustered mutations into DBS, MBS, Omikli, and Kataegis (requires VAF scores in TCGA/Sanger format if includedVAFs=True). Default: False |
plotIMDfigure |
Boolean | Generate IMD and mutational spectra plots for each sample. Default: True |
plotRainfall |
Boolean | Generate rainfall plots using subclassified clustered events. Default: True |
Parameters used when subClassify=True¶
| Parameter | Variable Type | Parameter Description |
|---|---|---|
includedVAFs |
Boolean | Indicates that VAF scores are present in the input files and should be used for subclassification. Default: True |
includedCCFs |
Boolean | Indicates that Cancer Cell Fraction (CCF) estimates are used instead of VAFs. When True, set includedVAFs=False. Default: True |
variant_caller |
String | Format of the VAF column in the input VCF. Accepted values: "standard" (default), "caveman", "mutect2" |
windowSize |
Integer | Window size (in bp) for calculating mutation density in rainfall plots. Default: 10000000 |
correction |
Boolean | Apply genome-wide mutational density correction to the IMD threshold. Default: False |
probability |
Boolean | Calculate the probability of observing each clustered event within its local genomic region. Results are appended as an extra column in the output VCF files under [project_path]/output/clustered/. Default: False |
VAF Formats¶
SigProfilerClusters uses variant allele frequencies (VAF) to subclassify clustered mutations when subClassify=True and includedVAFs=True. The VAF may be recorded in different formats depending on the variant caller used. Select the appropriate variant_caller value using the table below:
variant_caller value |
VAF column location |
|---|---|
"caveman" |
11th column; last colon-delimited value (e.g., 0:0:0.25) |
"standard" |
8th or 10th column as VAF=xx or AF=xx |
"mutect2" |
10th or 11th column as AF=xx |
No VAF available
If your input files contain no VAF information, set includedVAFs=False. SigProfilerClusters will still subclassify clusters based solely on the calculated IMD values (when subClassify=True).
Non-VCF input
If you are using a non-VCF input format (MAF, simple text, ICGC), VAFs cannot be extracted from the file. In this case, set subClassify=True and includedVAFs=False.
CCF Format¶
As an alternative to VAFs, SigProfilerClusters accepts cancer cell fraction (CCF) estimates to correct for copy number amplifications. To use CCFs:
- Set
includedCCFs=TrueandincludedVAFs=False. - Add the CCF value as a tab-separated extra column at the end of each variant row in the input VCF file.