Using SigProfilerExtractor - Input
Refer to this page to learn more about SigProfilerExtractor's functions and the different parameters that are accepted.
Functions¶
There are four functions supported by SigProfilerExtractor.
- importdata
- sigProfilerExtractor
- estimate_best_solution
- decompose [deprecated]
importdata¶
The function importdata imports the path to the example data that is provided by SigProfilerExtractor. This data can be used to run the example program and ensure that the environment is set up.
To import the data, first import the package within your python script or from within an interactive python3 session:
$ python3
>> from SigProfilerExtractor import sigpro as sig
Next, chose the data type that you would like to import:
data = sig.importdata(datatype)
| Parameter | Variable Type | Parameter Description |
|---|---|---|
| datatype | String | The format of example data to be imported. Accepted values include: |
sigProfilerExtractor¶
This function extracts mutational signatures from an array of samples. To run the function, first import the package within your python script or from within an interactive python3 session:
$ python3
>> from SigProfilerExtractor import sigpro as sig
Now, you are able to extract signatures from your samples. The required parameters for the function are:
sig.sigProfilerExtractor(input_type, output, input_data)
The full list of parameters is included in the following table:
| Category | Parameter | Variable Type | Parameter Description |
|---|---|---|---|
| Input Data | |||
| input_type | String | The type of input:
|
|
| out_put | String | The name of the output folder. The output folder will be generated in the current working directory. | |
| input_data | String | Name of the input folder (in case of "vcf" type input) or the input file (in case of "table" type input). The project file or folder should be inside the current working directory. For the "vcf" type input, the project has to be a folder which will contain the vcf files in vcf format or text formats. The "text" type projects have to be a file. | |
| reference_genome | String | The name of the reference genome. The default reference genome is "GRCh37". This parameter is applicable only if the input_type is "vcf". | |
| opportunity_genome | String | The build or version of the reference signatures for the reference genome. The default opportunity genome is GRCh37. If the input_type is "vcf", the genome_build automatically matches the input reference genome value. | |
| context_type | String | A string of mutaion context name/names separated by comma (","). The items in the list defines the mutational contexts to be considered to extract the signatures. The default value is "96,DINUC,ID", where "96" is the SBS96 context, "DINUC" is the DINUCLEOTIDE context and ID is INDEL context. | |
| exome | Boolean | Defines if the exomes will be extracted. The default value is "False". | |
| NMF Replicates | |||
| minimum_signatures | Positive Integer | The minimum number of signatures to be extracted. The default value is 1. | |
| maximum_signatures | Positive Integer | The maximum number of signatures to be extracted. The default value is 25. | |
| nmf_replicates | Positive Integer | The number of iteration to be performed to extract each number signature. The default value is 100. | |
| resample | Boolean | Default is True. If True, add poisson noise to samples by resampling. | |
| seeds | String | It can be used to get reproducible resamples for the NMF replicates. A path of a tab separated .txt file containing the replicated id and preset seeds in a two columns dataframe can be passed through this parameter. The Seeds.txt file in the results folder from a previous analysis can be used for the seeds parameter in a new analysis. The Default value for this parameter is "random". When "random", the seeds for resampling will be random for different analysis. | |
| NMF Engines | |||
| matrix_normalization | String | Method of normalizing the genome matrix before it is analyzed by NMF. Default is value is "gmm". Other options are, "log2", "custom" or "none". | |
| nmf_init | String | The initialization algorithm for W and H matrix of NMF. Options are 'random', 'nndsvd', 'nndsvda', 'nndsvdar' and 'nndsvd_min'. Default is 'random'. | |
| precision | String | Values should be single or double. Default is single. | |
| min_nmf_iterations | Integer | Value defines the minimum number of iterations to be completed before NMF converges. Default is 10000. | |
| max_nmf_iterations | Integer | Value defines the maximum number of iterations to be completed before NMF converges. Default is 1000000. | |
| nmf_test_conv | Integer | Value defines the number number of iterations to done between checking next convergence. Default is 10000. | |
| nmf_tolerance | Float | Value defines the tolerance to achieve to converge. Default is 1e-15. | |
| Execution | |||
| cpu | Integer | The number of processors to be used to extract the signatures. The default value is -1 which will use all available processors. | |
| gpu | Boolean | Defines if the GPU resource will used if available. Default is False. If True, the GPU resources will be used in the computation. Note: All available CPU processors are used by default, which may cause a memory error. This error can be resolved by reducing the number of CPU processes through the cpu parameter. | |
| batch_size | Integer | Will be effective only if the GPU is used. Defines the number of NMF replicates to be performed by each CPU during the parallel processing. Default is 1. | |
| Solution Estimation Thresholds | |||
| stability | Float | Default is 0.8. The cutoff thresh-hold of the average stability. Solutions with average stabilities below this thresh-hold will not be considered. | |
| min_stability | Float | Default is 0.2. The cutoff thresh-hold of the minimum stability. Solutions with minimum stabilities below this thresh-hold will not be considered. | |
| combined_stability | Float | Default is 1.0. The cutoff thresh-hold of the combined stability (sum of average and minimum stability). Solutions with combined stabilities below this thresh-hold will not be considered. | |
| Decomposition | |||
| cosmic_version | Float | Takes a positive float among 1, 2, 3, 3.1, 3.2. Default is 3.1. Defines the version of COSMIC reference signatures. | |
| de_novo_fit_penalty | Float | Takes any positive float. Default is 0.02. Defines the weak (remove) thresh-hold cutoff to assign denovo signatures to a sample. | |
| nnls_add_penalty | Float | Takes any positive float. Default is 0.05. Defines the strong (add) thresh-hold cutoff to assign COSMIC signatures to a sample. | |
| nnls_remove_penalty | Float | Takes any positive float. Default is 0.01. Defines the weak (remove) thresh-hold cutoff to assign COSMIC signatures to a sample. | |
| initial_remove_penalty | Float | Takes any positive float. Default is 0.05. Defines the initial weak (remove) thresh-hold cutoff to COSMIC assign signatures to a sample. | |
| refit_denovo_signatures | Boolean | Default is True. If True, then refit the denovo signatures with nnls. | |
| make_decomposition_plots | Boolean | Defualt is True. If True, Denovo to Cosmic sigantures decompostion plots will be created as a part the results. | |
| collapse_to_SBS96 | Boolean | Defualt is True. If True, SBS288 and SBS1536 Denovo signatures will be mapped to SBS96 reference signatures. If False, those will be mapped to reference signatures of the same context. | |
| Others | |||
| get_all_signature_matrices | Boolean | If True, the Ws and Hs from all the NMF iterations are generated in the output. | |
| export_probabilities | Boolean | Defualt is True. If False, then doesn't create the probability matrix. |
To get help on the parameters and outputs of the sigProfilerExtractor function, use help(sig.sigProfilerExtractor).
estimate_best_solution¶
This function estimates the optimum solution (rank) among different number of solutions (ranks). To use the function, first import the package within your python script or from within an interactive python3 session:
$ python3
>>> from SigProfilerExtractor import estimate_best_solution as ebs
To estimate the optimum solution, run the following:
ebs.estimate_solution()
Reference the table below for the full list of parameters:
| Parameter | Variable Type | Parameter Description |
|---|---|---|
| base_csvfile | String | Default is "All_solutions_stat.csv". Path to a csv file that contains the statistics of all solutions. |
| All_solution | String | Default is "All_Solutions". Path to a folder that contains the results of all solutions. |
| genomes | String | Default is Samples.txt. Path to a tab delimilted file that contains the mutation counts for all genomes given to different mutation types. |
| output | String | Default is "results". Path to the output folder. |
| title | String | Default is "Selection_Plot". This sets the title of the selection_plot.pdf |
| stability | Float | Default is 0.8. The cutoff thresh-hold of the average stability. Solutions with average stabilities below this thresh-hold will not be considered. |
| min_stability | Float | Default is 0.2. The cutoff thresh-hold of the minimum stability. Solutions with minimum stabilities below this thresh-hold will not be considered. |
| combined_stability | Float | Default is 1.0. The cutoff thresh-hold of the combined stability (sum of average and minimum stability). Solutions with combined stabilities below this thresh-hold will not be considered. |
| exome | Boolean | Default is "False". Defines if exomes samples are used. |
decompose¶
The functionality of the decompose function has been deprecated in SigProfilerExtractor and is now being actively maintained in SigProfilerAssignment. Please visit SigProfilerAssignment to learn more.