Using SigProfilerTopography - Input
Refer to this section for details about generating topography analyses for nucleosome occupancy, histone occupancy, CTCF transcription factor occupancy, replication timing, replication strand asymmetry, transcription strand asymmetry, genic versus intergenic regions and strand-coordinated mutagenesis. These analyses can be generated with the runAnalyses function. This section goes over the available functions and detailed list of valid parameter values.
Functions¶
There are five functions supported by SigProfilerTopography.
- install_nucleosome
- install_atac_seq
- install_repli_seq
- install_example_data
- runAnalyses
install_nucleosome¶
The function install_nucleosome imports the nucleosome library file that is necessary for nucleosome occupancy analyses.
To install the nucleosome data, first import the package within your python script or from within an interactive python3 session:
$ python3
>> from SigProfilerTopography import Topography as topography
Next, choose the genome that you would like to import:
By default, install_nucleosome imports nucleosome data of K562 cell line for GRCh37 and GRCh38 genome assemblies.
topography.install_nucleosome(genome)
| Parameter | Variable Type | Optional/Required | Parameter Description |
|---|---|---|---|
| genome | String | Required | The genome assembly of nucleosome data to be imported. Accepted values include: {"GRCh37", "GRCh38", "mm10"}. |
| biosample | String | Optional | The biosample of nucleosome data to be imported. Accepted values include: {"K562", "GM12878"}. |
install_atac_seq¶
The function install_atac_seq imports the open chromatin library file that is necessary for epigenomics analyses.
To install the open chromatin data, first import the package within your python script or from within an interactive python3 session:
$ python3
>> from SigProfilerTopography import Topography as topography
Next, choose the genome that you would like to import:
By default, install_atac_seq imports open chromatin data of breast epithelium tissue for GRCh37 and left lung tissue for GRCh38.
topography.install_atac_seq(genome)
| Parameter | Variable Type | Optional/Required | Parameter Description |
|---|---|---|---|
| genome | String | Required | The genome assembly of open chromatin data to be imported. Accepted values include: {"GRCh37", "GRCh38", "mm10"}. |
install_repli_seq¶
The function install_repli_seq imports the replication timing library file that is necessary for replication timing analyses.
To install the replication data, first import the package within your python script or from within an interactive python3 session:
$ python3
>> from SigProfilerTopography import Topography as topography
Next, choose the genome that you would like to import:
By default, install_repli_seq imports replication time data of MCF7 and IMR90 for GRCh37 and GRCh38, respectively.
topography.install_repli_seq(genome, biosample=None)
| Parameter | Variable Type | Optional/Required | Parameter Description |
|---|---|---|---|
| genome | String | Required | The genome assembly of replication time data to be imported. Accepted values include: |
| biosample | String | Optional | The biosample of replication time data to be imported. Accepted values include: {"MCF7", "HEPG2", "HELAS3", "SKNSH", "K562", "IMR90", "NHEK", "BJ", "HUVEC", "BG02ES", "GM12878", "GM06990", "GM12801", "GM12812", "GM12813", "HEK293", "HCT116", "A549", "CAKI2", "G401", "T47D", "SKNMC", "NCIH460" }. |
install_example_data¶
The function install_example_data imports the example data that is provided by SigProfilerTopography. This data can be used to run the example program and ensure that the environment is set up.
$ python3
>> from SigProfilerTopography import Topography as topography
Next, import the example data:
topography.install_example_data()
Imports 21BRCA.zip under the current working directory. Once 21BRCA.zip has been downloaded, unzip the file. The unzipped 21BRCA folder contains two folders: 21BRCA_vcfs and 21BRCA_probabilities. The folder 21BRCA_vcfs contains 21 VCF files (one per each breast cancer sample) and 21BRCA_probabilities` contains probability matrix files for single base substitutions and doublet base substitutions.
runAnalyses¶
$ python3
>> from SigProfilerTopography import Topography as topography
Now, you are able to run topography analyses for your samples. Here is an example of a call to runAnalyses that generates all of the different analyses.
topography.runAnalyses(genome,
inputDir,
outputDir,
jobname,
numofSimulations,
epigenomics=True,
nucleosome=True,
replication_time=True,
strand_bias=True,
processivity=True)
SigProfilerTopography's runAnalyses function makes it possible to produce different analyses in the same run. Depending on which parameters are provided, the function will generate a combination of analyses from: nucleosome occupancy, epigenomics occupancy, replication timing, replication strand asymmetry, transctiption strand asymmetry, genic versus intergenic regions and strand-coordinated mutagenesis (processivity).
The full list of parameters are detailed in the following table.
| Parameter Name | Variable Type | Optional/Required | Function Description |
|---|---|---|---|
| genome | String | Required | The reference genome used for the topography analyses. Accepted values include: {"GRCh37", "GRCh38", "mm10"}. |
| inputDir | String | Required | The path to the directory containing the input files. SigProfilerTopography accepts all input files that SigProfilerMatriXGenerator can process. |
| outputDir | String | Required | The path of the directory where the output will be saved. If this directory doesn't exist, a new one will be created. |
| jobname | String | Required | The name of the directory containing all of the outputs under outputDir/jobname. If this directory doesn't exist, a new one will be created. |
| numofSimulations | Integer | Required | The number of simulations to be created. |
| epigenomics | Boolean | Optional | Generate epigenomics analysis when True. By default, this is set to False. |
| nucleosome | Boolean | Optional | Generate nucleosome occupancy analysis when True. By default, this is set to False. |
| replication_time | Boolean | Optional | Generate replication timing analysis when True. By default, this is set to False. |
| strand_bias | Boolean | Optional | Generate replication and transcription strand asymmetry analysis when True. By default, this is set to False. |
| replication_strand_bias | Boolean | Optional | Generate replication strand asymmetry analysis when True. By default, this is set to False. |
| transcription_strand_bias | Boolean | Optional | Generate transcription strand asymmetry analysis (including genic versus intergenic regions) when True. By default, this is set to False. |
| processivity | Boolean | Optional | Generate strand-coordinated mutagenesis when True. By default, this is set to False. |
| epigenomics_files | List of Strings | Optional | Python list of paths for each epigenomics library file utilized in the epigenomics analysis. By default, epigenomics files of open chromatin, CTCF and histone modifications attained from "breast_epithelium" and "lung" tissue are utilized for GRCh37 and GRCh38, respectively. |
| epigenomics_dna_elements | List of Strings | Optional | Python list of unique DNA element names for the epigenomics files utilized in the epigenomics analysis. Each DNA element name must be contained in at least one epigenomics library filename. E.g., DNA element is 'CTCF' for the epigenomics file of 'ENCFF782GCQ_breast_epithelium_Normal_CTCF-human.bed'. By default, DNA elements of ['H3K27me3', 'H3K36me3', 'H3K9me3', 'H3K27ac', 'H3K4me1', 'H3K4me3', 'CTCF', 'ATAC'] are utilized for GRCh37 and GRCh38. If user provided epigenomics_files is provided, then epigenomics_dna_elements is mandatory. |
| epigenomics_biosamples | List of Strings | Optional | Python list of unique biosample names for the epigenomics files utilized in the epigenomics analyses. Each biosample name must be contained in at least one epigenomics library filename. E.g., biosample is 'breast_epithelium' for the epigenomics file of 'ENCFF782GCQ_breast_epithelium_Normal_CTCF-human.bed'. By default, "breast_epithelium" and "lung" biosamples are utilized for GRCh37 and GRCh38, respectively. Biosamples are shown in the epigenomics heatmaps if plot_detailed_epigemomics_heatmaps is set to True. |
| nucleosome_biosample | String | Optional | Biosample that will be used for nucleosome occupancy analysis. Analysis can be done by using either K562 or GM12878 cell line from ENCODE. By default, the K562 cell line is used for GRCh37 and GRCh38. |
| nucleosome_file | String | Optional | The path to the nucleosome occupancy library file that will be used for the analysis. By default, nucleosome occupancy file (MNase-seq) of K562 cell line is used for GRCh37 and GRCh38. |
| replication_time_biosample | String | Optional | Biosample that will be used to carry out replication timing and replication strand asymmetry analyses. By default, MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively. For the complete list of available replication time biosamples, refer to the Replication Time Biosamples table below. |
| replication_time_signal_file | String | Optional | The path to the replication time signal file. By default, replication time signal file (wig file) of MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively. |
| replication_time_valley_file | String | Optional | The path to the replication time valley file. By default, replication time valley file (bed file) of MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively. |
| replication_time_peak_file | String | Optional | The path to the replication time peak file. By default, replication time peak file (bed file) of MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively. |
| samples_of_interest | List of Strings | Optional | Conduct topography analyses for these samples of interest only. By default, it is set to None and topography analyses are carried out for all samples. |
| discreet_mode | Boolean | Optional | Each mutation contributes to the topography analyses either with 1 or 0 when True; otherwise, each mutation contributes with its probability when False. By default, this is set to True. |
| average_probability | Float | Optional | The average probability of the mutations assigned to a SBS, DBS, and ID signature. By default, it is set to 0.90. The average_probability applies when discreet_mode is True. We set signature specific cutoffs, such that for the mutations satisfying mutation_signature_probability >= cutoff, average probability of these mutations must be at least 0.90. |
| num_of_sbs_required | Integer | Optional | The minimum required number of mutations for a SBS signature. The num_of_sbs_required applies when discreet_mode is True or when discreet_mode is False and show_all_signatures is False. By default, it is set to 2000. |
| num_of_dbs_required | Integer | Optional | The minimum required number of mutations for a DBS signature. The num_of_dbs_required applies when discreet_mode is True or when discreet_mode is False and show_all_signatures is False. By default, it is set to 200. |
| num_of_id_required | Integer | Optional | The minimum required number of mutations for a ID signature. The num_of_id_required applies when discreet_mode is True or when discreet_mode is False and show_all_signatures is False. By default, it is set to 1000. |
| exceptional_signatures | Dictionary | Optional | The dictionary of exceptional signatures. The exceptional_signatures applies when discreet_mode is True. E.g., exceptional_signatures = {"SBS32" : 0.63} is a Python dictionary where key is a mutational signature and value is an average probability. Exceptional signatures are included in the topography analyses if they satisfy num_of_sbs_required, num_of_dbs_required, and num_of_id_required constraints with average_probability >= given average probability. |
| default_cutoff | Float | Optional | The default_cutoff applies for all signatures when discreet_mode is False. Mutations satisfying mutation_signature_probability >= default_cutoff are considered in the topography analyses with their probability. By default, it is set to 0.5. |
| show_all_signatures | Boolean | Optional | The show_all_signatures applies when discreet_mode is False. All signatures are considered in the topography analyses when True, otherwise signatures satisfying num_of_sbs_required, num_of_dbs_required, and num_of_id_required are considered in the topography analyses when False. By default, it is set to True. |
| plot_figures | Boolean | Optional | Generate plots displaying the results of all topography analyses when True. By default, this is set to True. |
| plot_epigenomics | Boolean | Optional | Generate epigenomics heatmaps and occupancy plots when True. By default, this is set to False. |
| plot_nucleosome | Boolean | Optional | Generate nucleosome occupancy plots when True. By default, this is set to False. |
| plot_replication_time | Boolean | Optional | Generate replication timing plots when True. By default, this is set to False. |
| plot_strand_bias | Boolean | Optional | Generate replication strand asymmetry, transcription strand asymmetry, genic versus intergenic regions plots when True. By default, this is set to False. |
| plot_replication_strand_bias | Boolean | Optional | Generate replication strand asymmetry plots when True. By default, this is set to False. |
| plot_transcription_strand_bias | Boolean | Optional | Generate transcription strand asymmetry and genic versus intergenic regions plots when True. By default, this is set to False. |
| plot_processivity | Boolean | Optional | Generate strand-coordinated mutagenesis plots when True. By default, this is set to False. |
| step1_matgen_real_data | Boolean | Optional | Run SigProfilerMatrixGenerator to generate matrices for the real mutations when True. By default, this is set to True. |
| step2_gen_sim_data | Boolean | Optional | Run SigProfilerSimulator to generate simulated mutations when True. By default, this is set to True. |
| step3_matgen_sim_data | Boolean | Optional | Run SigProfilerMatrixGenerator to generate matrices for the simulated mutations when True. By default, this is set to True. |
| step4_merge_prob_data | Boolean | Optional | Merge real and simulated mutations with the probabilities files when True. By default, this is set to True. |
| step5_gen_tables | Boolean | Optional | Generate tables for providing information on mutational signatures, cutoffs, number of mutations and average probability when True. By default, this is set to True. |
| sbs_probabilities | String | Optional | The path to the probabilities matrix file. The probabilities matrix includes the probabilities of each mutation type in each sample. The first column lists all the samples, the second column lists all the mutation types, and the following columns list the calculated probability value for the respective SBS signatures where the sum of each row is 1. The probabilities file can be in SBS_6, SBS_24 SBS_96, SBS_192, SBS_288, SBS_384, SBS_1536, or SBS_6144 context produced by mutational signature extractor. |
| dbs_probabilities | String | Optional | The path to the probabilities matrix file. The probabilities matrix includes the probabilities of each mutation type in each sample. The first column lists all the samples, the second column lists all the mutation types, and the following columns list the calculated probability value for the respective DBS signatures where the sum of each row is 1. The probabilities file in DBS-78 context produced by mutational signature extractor. |
| id_probabilities | String | Optional | The path to the probabilities matrix file. The probabilities matrix includes the probabilities of each mutation type in each sample. The first column lists all the samples, the second column lists all the mutation types, and the following columns list the calculated probability value for the respective ID signatures where the sum of each row is 1. The probabilities file in ID-83 context produced by mutational signature extractor. |
| sbs_signatures | String | Optional | The path to the signatures matrix file. The signatures matrix contains the distribution of mutation types in the SBS mutational signatures. The first column lists all of the mutation types. e.g., There are 96 possible mutations that are considered for the SBS-96 context. The following columns are the SBS signatures. The sum of each column is 1, and each value in a column indicates the proportion of a mutational context in the signature. |
| dbs_signatatures | String | Optional | The path to the signatures matrix file. The signatures matrix contains the distribution of mutation types in the DBS mutational signatures. The first column lists all of the mutation types. e.g., There are 78 possible mutations that are considered for the DBS-78 context. The following columns are the DBS signatures. The sum of each column is 1, and each value in a column indicates the proportion of a mutational context in the signature. |
| id_signatures | String | Optional | The path to the signatures matrix file. The signatures matrix contains the distribution of mutation types in the ID mutational signatures. The first column lists all of the mutation types. e.g., There are 83 possible mutations that are considered for the ID-83 context. The following columns are the ID signatures. The sum of each column is 1, and each value in a column indicates the proportion of a mutational context in the signature. |
| sbs_activities | String | Optional | The path to the activities matrix file. The activity matrix for the selected SBS signatures. The first column lists all of the samples and the second and the following columns list the calculated activity value (number of mutations) for the respective SBS signatures. |
| dbs_activities | String | Optional | The path to the activities matrix file. The activity matrix for the selected DBS signatures. The first column lists all of the samples and the second and the following columns list the calculated activity value (number of mutations) for the respective DBS signatures. |
| id_activities | String | Optional | The path to the activities matrix file. The activity matrix for the selected ID signatures. The first column lists all of the samples and the second and the following columns list the calculated activity value (number of mutations) for the respective ID signatures. |
| verbose | Boolean | Optional | Set to True for detailed debugging messages. By default, this is set to False. |
| parallel_mode | Boolean | Optional | Set to True for running SigProfilerTopography using multiprocessing. By default, this is set to True. |
| plusorMinus_epigenomics | Integer | Optional | The number of bases considered before and after mutation start for epigenomics occupancy analysis. |
| plusorMinus_nucleosome | Integer | Optional | The number of bases considered before and after mutation start for nucleosome occupancy analysis. |
| epigenomics_heatmap_significance_level | Float | Optional | Corrected p-values <= epigenomics_heatmap_significance_level are considered statistically significant. By default, this is set to 0.05. |
| fold_change_window_size | Integer | Optional | In epigenomics analysis, fold change of real versus simulated mutations is calculated for the window size centered at the mutation start. E.g., for window size of 100 bases, ± 50 bases are considered before and after mutation start. By default, this is set to 100. |
| num_of_avg_overlap_required | Integer | Optional | The minimum required average number of overlaps between the mutations and the regions outlined in the epigenomics files. By default, set to 100. |
| plot_detailed_epigemomics_heatmaps | Boolean | Optional | Plot detailed epigenomics heatmaps when True. By default, set to False. |
| remove_dna_elements_with_all_nans_in_epigemomics_heatmaps | Boolean | Optional | Remove the DNA elements from the epigenomics heatmap if no result exists. By default, set to True. |
| odds_ratio_cutoff | Float | Optional | Strand asymmetries with odd ratio >= odds_ratio_cutoff are shown in the strand asymmetry circle plots. By default, set to 1.1. |
| percentage_of_real_mutations_cutoff | Float | Optional | Strand asymmetries of the SBS signatures with percentage of the mutations >= percentage_of_real_mutations_cutoff are shown in the plots. By default, set to 5. |
| ylim_multiplier | Float | Optional | Multiply the y-axis view limits with ylim_multiplier in strand asymmetry bar plots. By default, set to 1.25. |
| processivity_inter_mutational_distance | Integer | Optional | Consecutive mutations with distance <= processivity_inter_mutational_distance are considered for the strand-coordinated mutagenesis. By default, set to 10000. |
| processivity_significance_level | Float | Optional | Corrected p-values <= processivity_significance_level are considered statistically significant for strand coordinated mutagenesis. By default, this is set to 0.05. |
| delete_chrbased_files | Boolean | Optional | To reduce the disk space usage of the tool, SigProfilerTopography deletes the chrbased files under outputDir/jobname/data/chrbased. By default, set to True. |
| exome | Boolean | Optional | SigProfilerSimulator simulates on the exome of the reference genome. By default, set to None. |
| updating | Boolean | Optional | SigProfilerSimulator updates the chromosome with each mutation. By default, set to False. |
| bed_file | String | Optional | SigProfilerSimulator simulates on custom regions of the genome. Requires the full path to the BED file. By default, set to None. |
| overlap | Boolean | Optional | SigProfilerSimulator allows overlapping of mutations along the chromosome. By default, set to False. |
| gender | String | Optional | SigProfilerSimulator simulates male or female genomes. By default, set to 'female'. |
| seed_file | String | Optional | SigProfilerSimulator uses this path to user defined seeds. One seed is required per processor. Uses a built in file by default. By default, this is set to None. |
| noisePoisson | Boolean | Optional | SigProfilerSimulator adds poisson noise to the simulations. By default, set to False. |
| noiseUniform | Integer | Optional | SigProfilerSimulator adds a noise dependent on a +/- allowance of noise (e.g., noiseUniform=5 allows +/-2.5% of mutations for each mutation type). By default, this is set to 0. |
| cushion | Integer | Optional | SigProfilerSimulator allows cushion when simulating on the exome or targetted panel. By default, this is set to 100 base pairs. |
| region | String | Optional | For SigProfilerSimulator. Path to targetted region panel for simulated on a user-defined region. Default is whole-genome simulations. |
| vcf | Boolean | Optional | SigProfilerSimulator outputs simulated samples as vcf files with one file per iteration per sample when True. SigProfilerSimulator outputs all samples from an iteration into a single maf file when False. By default, this is set to False. |
| mask | String | Optional | For SigProfilerSimulator. Path to probability mask file. A mask file format is tab-separated with the following required columns: Chromosome, Start, End, Probability. Note: Mask parameter does not support exome data where bed_file flag is set to true, and the following header fields are required: Chromosome, Start, End, Probability. By default, this is set to None. |
** If replication_time_signal_file or replication_time_signal_file, replication_time_valley_file, and replication_time_peak_file files are provided, then the parameter replication_time_biosample is not used.
Replication Time Biosamples¶
Included in the table below is a list of available parameter values for replication_time_biosample.
| Biosample | Organism | Tissue | Cell Type | Diseases |
|---|---|---|---|---|
| MCF7 | human | breast | mammary | Cancer |
| HEPG2 | human | liver | liver cells (hepatocytes) | Cancer |
| HELAS3 | human | cervix | epithelial-like cervical cells | Cancer |
| SKNSH | human | brain | neuronal-like cells | Cancer |
| K562 | human | bone marrow | lymphoblast cells | Cancer |
| IMR90 | human | lung | fibroblast | Normal |
| NHEK | human | skin | keratinocyte | Normal |
| BJ | human | skin | fibroblast | Normal |
| HUVEC | human | skin | fibroblast | Normal |
| BG02ES | human | early developmental stage of an embryo, not from a differentiated tissue | embyronic stem cell | None reported |
| GM12878 | human | blood | B-Lymphocyte | Normal |
| GM06990 | human | blood | B-Lymphocyte | Unknown |
| GM12801 | human | blood | B-Lymphocyte | Unknown |
| GM12812 | human | blood | B-Lymphocyte | Unknown |
| GM12813 | human | blood | B-Lymphocyte | Unknown |
| HEK293 | human | kidney | embryonic kidney cells | Normal |
| HCT116 | human | colon | colorectal carcinoma cell | Cancer |
| A549 | human | lung | epithelial cell | Cancer |
| CAKI2 | human | kidney | papillary renal cell carcinoma cell | Cancer |
| G401 | human | kidney | epithelial kidney cells | Cancer |
| T47D | human | breast; mammary gland | epithelial cell | Cancer |
| SKNMC | human | brain | peripheral primitive neuroectodermal | Cancer (Askin tumor) |
| NCIH460 | kuman | lung | lung carcinoma cell | Cancer |
-
REPLICATION TIMING and REPLICATION STRAND ASYMMETRY
-
By default, SigProfilerTopography carries out replication timing and replication strand asymmetry analyses using Repli-seq of MCF7 and IMR90 cell line for GRCh37 and GRCh38, respectively.
-
If you want to run SigProfilerTopography with Repli-seq of e.g., HELAS3 cell line, you may first install replication timing data for the genome of interest e.g.: GRCh37 as follows:
$ python >> from SigProfilerTopography import Topography as topography >> topography.install_repli_seq('GRCh37', 'HELAS3') -
Then you have to include
replication_time_biosample='HELAS3'in therunAnalysescall as follows:>>> from SigProfilerTopography import Topography as topography >>> genome = "GRCh37" >>> inputDir = "path/to/21BRCA_vcfs" >>> outputDir = "path/to/results" >>> jobname = "21BRCA_SPT_with_probability_matrices" >>> numofSimulations = 5 >>> sbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_SBS96_Decomposed_Mutation_Probabilities.txt" >>> dbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_DBS78_Decomposed_Mutation_Probabilities.txt" >>> if __name__ == "__main__": topography.runAnalyses(genome, inputDir, outputDir, jobname, numofSimulations, sbs_probabilities = sbs_probability_file, dbs_probabilities = dbs_probability_file, replication_time_biosample='HELAS3', epigenomics=True, nucleosome=True, replication_time=True, strand_bias=True, processivity=True) -
If you do not install replication timing file before the run, SigProfilerTopography downloads replication timing files from ftp://alexandrovlab-ftp.ucsd.edu/ under .../SigProfilerTopography/lib/replication/ for the
replication_time_biosampleof interest during runtime which requires ~20-100 MB of storage. -
If you have a replication timing file, you can set the
replication_time_signal_fileand run replication timing and replication strand asymmetry analyses using your own replication timing file.We require a tab-separated file with four columns for
replication_time_signal_file. No header line is required. The columns should contain the following information: 1. Chromosome (e.g., chr1) 2. Start position (e.g., 10000) 3. End position (e.g., 15000) 4. Signal value (e.g., 1.0343)Then you have to set
replication_time_signal_filein therunAnalysescall as follows:>>> from SigProfilerTopography import Topography as topography >>> genome = "GRCh37" >>> inputDir = "path/to/21BRCA_vcfs" >>> outputDir = "path/to/results" >>> jobname = "21BRCA_SPT_with_probability_matrices" >>> numofSimulations = 5 >>> sbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_SBS96_Decomposed_Mutation_Probabilities.txt" >>> dbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_DBS78_Decomposed_Mutation_Probabilities.txt" >>> if __name__ == "__main__": topography.runAnalyses(genome, inputDir, outputDir, jobname, numofSimulations, sbs_probabilities = sbs_probability_file, dbs_probabilities = dbs_probability_file, replication_time_signal_file="path/to/replication_timing_file", epigenomics=True, nucleosome=True, replication_time=True, strand_bias=True, processivity=True) -
SigProfilerTopography, annotates each mutation with its replication strand.
Replication strand can be one of the below:
A: Lagging
E: Leading
U: Unknown
B: Bidirectional (Both lagging and leading can happen for long indels).You can reach them under
outputDir/jobname/data/chrbased, if you setdelete_chrbased_files=Falseas follows.>>> from SigProfilerTopography import Topography as topography >>> genome = "GRCh37" >>> inputDir = "path/to/21BRCA_vcfs" >>> outputDir = "path/to/results" >>> jobname = "21BRCA_SPT_with_probability_matrices" >>> numofSimulations = 5 >>> sbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_SBS96_Decomposed_Mutation_Probabilities.txt" >>> dbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_DBS78_Decomposed_Mutation_Probabilities.txt" >>> if __name__ == "__main__": topography.runAnalyses(genome, inputDir, outputDir, jobname, numofSimulations, sbs_probabilities = sbs_probability_file, dbs_probabilities = dbs_probability_file, replication_time_biosample="T47D", epigenomics=True, nucleosome=True, replication_time=True, strand_bias=True, processivity=True, delete_chrbased_files=False)
-