Quick Start Example
This section provides an example for users to quickly get started with using SigProfilerTopography tool. The following example will use somatic mutational data from 21 breast cancer samples and will showcase how to start with VCF input files.
Prerequisites¶
This tutorial requires that you have completed all steps in the installation guide:
- Python version >= 3.4.0
- Internet Connection (for initial installation)
- WGET version 1.9 or RSYNC if you have a firewall
- SigProfilerMatrixGenerator reference genome download
- GRCh37, GRCh38, mm9, mm10, rn6, or 288c.
- Since example data, 21 breast cancers, is in GRCh37, only installing SigProfilerMatrixGenerator GRCh37 reference genome will be sufficient.
- Other dependencies and necessary packages are downloaded during the installation.
Downloading Input Example Data¶
This example uses somatic mutational data from 21 breast cancer genomes. Download the example dataset 21BRCA.zip at the following location or use the commandline:
ftp://alexandrovlab-ftp.ucsd.edu/pub/tools/SigProfilerTopography/Example_data/
If using the command line, then enter the following command in bash on OS X or Unix systems:
$ wget ftp://alexandrovlab-ftp.ucsd.edu/pub/tools/SigProfilerTopography/Example_data/21BRCA.zip
Once 21BRCA.zip has been downloaded, unzip the file. The unzipped 21BRCA folder contains two folders: 21BRCA_vcfs and 21BRCA_probabilities. The folder 21BRCA_vcfs contains 21 VCF files (one per each breast cancer sample) and 21BRCA_probabilities` contains probabilities files for single base substitutions and doublet base substitutions.
Running SigProfilerTopography¶
You will be using the 21 VCF files located in the subfolder 21BRCA_vcfs as input for this example.
First, start a Python interactive shell and import the SigProfilerTopography library.
$ python3
>>> from SigProfilerTopography import Topography as topography
Second, install library files required for topography analyses for GRCh37 as 21 BRCA vcfs are in GRCh37.
>>> from SigProfilerTopography import Topography as topography
>>> topography.install_nucleosome("GRCh37")
>>> topography.install_atac_seq("GRCh37")
>>> topography.install_repli_seq("GRCh37")
If you haven't done so far, import SigProfilerMatrixGenerator and install reference genome for GRCh37.
>>> from SigProfilerMatrixGenerator import install as genInstall
>>> genInstall.install("GRCh37", rsync=False, bash=True)
Next, conduct topography analyses by running the following command. Note: Update "path/to/21BRCA_vcfs" with the actual path to the 21BRCA_vcfs folder. Similarly, update "path/to/results" with the actual path to the results folder.
>>> from SigProfilerTopography import Topography as topography
>>> genome = "GRCh37"
>>> inputDir = "path/to/21BRCA_vcfs"
>>> outputDir = "path/to/results"
>>> jobname = "21BRCA_SPT"
>>> numofSimulations = 5
>>> if __name__ == "__main__":
topography.runAnalyses(genome,
inputDir,
outputDir,
jobname,
numofSimulations,
epigenomics=True,
nucleosome=True,
replication_time=True,
strand_bias=True,
processivity=True)
If probability files are not provided, SigProfilerTopography utilizes SigProfilerAssignment by default to attribute the activities of known reference mutational signatures from the Catalogue Of Somatic Mutations In Cancer (COSMIC) database to each examined sample.
After the program has finished running, there will be an output directory named figure under path/to/results/21BRCA_SPT that will contain the resulting plots and is located in the directory where the Python instance was started. To learn more about the output produced by SigProfilerTopography, refer to Using the Tool - Output.
Running SigProfilerTopography (with probability files)¶
You will be using the 21 VCF files located in the subfolder 21BRCA_vcfs as input and providing the probability files in the subfolder 21BRCA_probabilities for this example.
First, start a Python interactive shell and import the SigProfilerTopography library.
$ python3
>>> from SigProfilerTopography import Topography as topography
Next, conduct the topography analyses by running the following command. Note: Update "path/to/21BRCA_vcfs" with the actual path to the 21BRCA_vcfs folder. Similarly, update "path/to/results" and "path/to/21BRCA_probabilities" with the actual paths.
>>> from SigProfilerTopography import Topography as topography
>>> genome = "GRCh37"
>>> inputDir = "path/to/21BRCA_vcfs"
>>> outputDir = "path/to/results"
>>> jobname = "21BRCA_SPT_with_probability_matrices"
>>> numofSimulations = 5
>>> sbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_SBS96_Decomposed_Mutation_Probabilities.txt"
>>> dbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_DBS78_Decomposed_Mutation_Probabilities.txt"
>>> if __name__ == "__main__":
topography.runAnalyses(genome,
inputDir,
outputDir,
jobname,
numofSimulations,
sbs_probabilities = sbs_probability_file,
dbs_probabilities = dbs_probability_file,
epigenomics=True,
nucleosome=True,
replication_time=True,
strand_bias=True,
processivity=True)
You can also run mutational signature extraction tool (e.g., SigProfilerExtractor) by yourself to derive the probability for each mutational signatures to generate each type of somatic mutation and provide the probability matrix files as an input to SigProfilerTopography.
After the program has finished running, there will be an output directory named figure under path/to/results/21BRCA_SPT_with_probability_matrices that will contain the resulting plots and is located in the directory where the Python instance was started. To learn more about the output produced by SigProfilerTopography, refer to Using the Tool - Output.
Supported Genomes¶
SigProfilerTopography currently supports these genomes:
-
GRCh38.p12 [GRCh38]
GRGRCh38.p12 [GRCh38] (Genome Reference Consortium Human Reference 38), INSDC Assembly GCA_000001405.27, Dec 2013. Released July 2014. Last updated January 2018.
This genome was downloaded from ENSEMBL database version 93.38. -
GRCh37.p13 [GRCh37]
GRCh37.p13 [GRCh37] (Genome Reference Consortium Human Reference 37), INSDC Assembly GCA_000001405.14, Feb 2009. Released April 2011. Last updated September 2013.
This genome was downloaded from ENSEMBL database version 93.37.
Additional Information¶
In the above examples, the other non specified parameters are passed in with their default values. All of the function arguments and their types are explained in detail in the Using the Tool - Input section. To learn more about the files that were produced, you can refer to Using the Tool - Output.