Skip to content

Using the Tool - SBS, ID, DBS Input

SigProfilerMatrixGenerator generates mutational matrices for Single Base Substitutions (SBS), Insertions/Deletions (ID), and Doublet Base Substitutions (DBS) from input variant files.

Python Usage

From within a Python session, generate matrices as follows:

python3
>>> from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
>>> matrices = matGen.SigProfilerMatrixGeneratorFunc("test", "GRCh37", "/Users/test/Desktop/test/")

R Usage

From within an R session:

R
> library("reticulate")
> use_python("path_to_your_python3")
> py_config()
> library("SigProfilerMatrixGeneratorR")
> matrices <- SigProfilerMatrixGeneratorR("test", "GRCh37", "/Users/test/Desktop/test/")

Function Arguments

Required Parameters

Parameter Type Description Example
project string Project name for this instance of matrix generation "alexandrov_lab_test_1"
genome string Reference genome to use "GRCh37"
vcfFiles string Full path to the input files folder "/Users/test/Desktop/test/"

Optional Parameters

Parameter Type Default Description
exome boolean False Downsamples mutational matrices to exome regions
bed_file string None Path to BED file for custom region downsampling
chrom_based boolean False Outputs chromosome-based matrices
plot boolean False Integrates with SigProfilerPlotting for visualizations
tsb_stat boolean False Outputs transcriptional strand bias test results
seqInfo boolean False Outputs original mutations with SigProfilerMatrixGenerator classification
cushion integer 100 Adds Xbp cushion to exome/bed_file ranges

Note: All string arguments must be surrounded by quotation marks (e.g., "test"), and all boolean arguments must be True or False.


Input File Formats

This tool supports the following input formats:

Format Description Example
MAF Mutation Annotation Format example.maf
VCF Variant Call Format (each sample as separate file) example.vcf
ICGC ICGC submission format ICGC docs
Simple text Tab-delimited text file See example below

Simple Text File Format

The simple text format requires the following columns: - Sample name - Chromosome - Position - Reference allele - Alternate allele


Output Folder Structure

The final output is divided into three folders:

Input

Contains copies of the user-provided input files.

Logs

Contains error and log files for the submitted job: - sigProfilerMatrixGenerator_[project]_[genome].err - sigProfilerMatrixGenerator_[project]_[genome].out

Output

Contains the following subfolders: - DBS/ - Doublet base substitution matrices - SBS/ - Single base substitution matrices - ID/ - Insertion/deletion matrices - TSB/ - Transcriptional strand bias results - plots/ - Generated visualizations - vcf_files/ - Processed VCF files


File Extensions

Output files have extensions indicating which arguments were passed:

Extension Description
.all Default - all mutations
.exome Mutations mapped to exome regions (exome=True)
.region Mutations mapped to custom BED file regions
.chrX Chromosome-specific mutations (chrom_based=True)