Skip to content

Using SigProfilerMatrixGenerator

SigProfilerMatrixGenerator works in conjunction with other SigProfiler tools but can also be run alone on input datasets. This section goes over the function arguments, input files, and all the output folders and files in detail.


From within a python session, you can now generate the matrices as follows:

$ python3
>>from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
>>matrices = matGen.SigProfilerMatrixGeneratorFunc(project, genome, vcfFiles, exome=False, bed_file=None, chrom_based=False, plot=False, tsb_stat=False, seqInfo=False)

From within a R session, you can now generate the matrices as follows:

$ R
>> library("reticulate")
>> use_python("path_to_your_python3")
>> py_config()
>> library("SigProfilerMatrixGeneratorR")
>> matrices <- SigProfilerMatrixGeneratorR("BRCA", "GRCh37", "/Users/ebergstr/Desktop/BRCA/", plot=T, exome=F, bed_file=NULL, chrom_based=F, tsb_stat=F, seqInfo=F, cushion=100)

Function Arguments ###

These are the acceptable parameters that can be passed into the function call.

Required:
- project: Project name for this instance of matrix generation.
Type: string
Example: "alexandrov_lab_test_1"

  • genome: Reference genome to use for the matrix generation.
    Type: string
    Example: "GRCh37"

  • vcfFiles: Full path of the saved input files in the desired output folder.
    Type: string
    Example: "/Users/test/Desktop/alexandrov_lab_test_1"

Optional:
- exome: Downsamples mutational matrices to the exome regions of the genome.
Type: boolean
Default: False
Example: exome=True

  • bed_file: Downsamples mutational matrices to custom regions of the genome. Requires the full path to the BED file.
    Type: string
    Default: None
    Example: bed_file="/Users/test/Desktop/bed_files/sample_1.bed"

  • chrom_based: Outputs chromosome-based matrices.
    Type: boolean
    Default: False
    Example: chrom_based=True

  • plot: Integrates with SigProfilerPlotting to output all available visualizations for each matrix.
    Type: boolean
    Default: False
    Example: plot=True

  • tsb_stat: Outputs the results of a transcriptional strand bias test for the respective matrices.
    Type: boolean
    Default: False
    Example: tsb_stat=True

  • seqInfo: Ouputs original mutations into a text file that contains the SigProfilerMatrixGenerator classificaiton for each mutation.
    Type: boolean
    Default: False
    Example: seqInfo=True

  • cushion: Adds an Xbp cushion to the exome/bed_file ranges for downsampling the mutations.
    Type: integer
    Default: 100
    Example: cushion=250

All string arguments must be surrounded by quotation marks ex. "test" and all boolean arguments must be True or False.

Input File

This tool currently supports the following formats: * MAF
Mutation Annotation Format [example.maf] * VCF
Variant Call Format [example.vcf]
If files are in .vcf format, each sample must be saved as a separate file. * ICGC
* Simple text file [example.txt]

The user must provide variant data adhering to one of these four formats.

Folder Structure ###

enter image description here
The final output is divided into three folders: * Input: Contains copies of the user-provided input files.enter image description here

  • Logs: Contains the error and log files for the submitted job. enter image description here
    All errors are saved in the sigProfilerMatrixGenerator_[project]_[genome].err file and all progress checkpoints are saved in the sigProfilerMatrixGenerator_[project]_[genome].out file within the specified output folder.
  • Output: Contains the DBS, SBS, INDEL, TSB, plots, and vcf_files folders. All matrices are saved in the appropriate folders. enter image description here

File Extensions

All output files will have a file extension indicative of which arguments were passed in as True. By default, the files will have .all file extension. The rest of the file extensions are explained below. * .exome
exome argument was passed in as True and contains all the mutations mapped out to the exome.

  • .region
    bed_file argument was passed in as string and contains all the mutations mapped out to the input bed_file regions.

  • .chrx where x denotes which chromosome i.e. chr1, chrA, etc.
    chrom_based argument was passed in as True and contains all the mutations mapped out to each chromosome.