-
Notifications
You must be signed in to change notification settings - Fork 16
Data Preparation Modules: panoply_normalize_ms_data
wcorinne edited this page Aug 28, 2025
·
4 revisions
This module normalizes proteomics (protein/PTM site) data. The normalization methods, especially two-component normalization, assume that the input data are log ratios to a common reference.
Normalization methods available are:
- Z-scoring (
mean): mean-centering followed by standard deviation scaling - Median-MAD normalization (
median): median-centering followed by median absolute deviation (MAD) scaling - Two-component mixture model-based normalization (
2comp): In this method, we assume that for every sample there is a set of unregulated proteins or PTM sites. In the normalized sample, these proteins or PTM sites should have a log ratio centered at zero. In addition, there are proteins or PTM sites that are either up- or downregulated. This normalization scheme attempts to identify the unregulated proteins or PTM sites, and centers the distribution of these log-ratios around zero in order to nullify the effect of differential protein loading and/or systematic MS variation. A 2-component Gaussian mixture model-based is used to achieve this effect. The two Gaussians N (μ1, σ1) and N (μ2, σ2) for a sample i are fitted and used in the normalization process as follows: the mode mi of the log-ratio distribution is determined for each sample using kernel density estimation with a Gaussian kernel and Shafer-Jones bandwidth. A two-component Gaussian mixture model is then fit with the mean of both Gaussians constrained to be mi, i.e., μi1 = μi2 = mi. The Gaussian with the smaller estimated standard deviation σi = min (σi1 , σi2) is assumed to represent the unregulated component of proteins/PTM sites, and is used to normalize the sample by subtracting the mean mi from each protein/PTM site and dividing by the standard deviation σi. See (Mertins et al., 2016) and (Gillette et al, 2020).
CAVEAT: The two-component mixture model-based normalization (2comp) method has been tuned for log-transformed ratio (to a common reference) data, and hence cannot be used with intensity-based (e.g., label-free) data.
The module also creates
- Profile plots (density plots) for each sample before and after normalization for comparison.
- Plots of summary of normalizaton statistics.
Required inputs:
-
inputData: (.tarfile) tarball frompanoply_parse_sm_tableor un-normalized input data ingctformat (whenstandaloneistrue) -
type: (String) proteomics data type -
standalone: (String) set totrueto run as a self-contained module; iftruetheanalysisDirinput is required -
yaml: (.yamlfile) parameters inyamlformat -
analysisDir: (String) name of analysis directory
Optional inputs:
-
normalizeProteomics: (String, default chosen in startup notebook) when 'true' normalization will be applied, when 'false' normalization is skipped -
normMethod: (String, default = '2comp') normalization method; options are '2comp', 'median', 'mean' -
altMethod: (String, default = 'median') alternate normalization method for comparison withnormMethod; downstream modules typically do not generate analyses for the data normalized usingaltMethod -
ndigits: (Int, default = 5) number of decimal digits to use in output tables -
outTar: (String, default = "panoply_normalize_ms_data-output.tar") output.tarfile name -
outTable: (String, default = "normalized_table-output.gct") output.gctnormalized file name
-
output_tar: Tarball including the following files in thenormalized-datasubdirectory:- Normalized data files:
- normalized data table (
*-ratio-norm.gct) - normalized data table using alternate normalization method specified in
altMethod(*-ratio-*-norm.gct)
- normalized data table (
- Plots and normalization statistics
- profile plot (density of log ratio values for each sample) showing distribution for all samples in input data, before normalization (
*-ratio-profile-plot.pdf) - profile plot showing distribution for samples after normalization, using primary normalization method (
*-ratio-norm-profile-plot.pdf) and alternate normalization method (*-ratio-median-norm-profile-plot.pdf) - normalization statistics table showing centering and scaling factors for each sample, using primary normalization method (
*-ratio-norm-stats.csv) and alternate normalization method (*-ratio-median-norm-stats.csv) - boxplot of normalization statistics for
QC.passandQC.failsamples (*-ratio-norm-stats.pdfand*-ratio-median-norm-stats.pdf)
- profile plot (density of log ratio values for each sample) showing distribution for all samples in input data, before normalization (
- Normalized data files:
-
outputs: (.gctfile) fnormalized data table (equivalent to*-ratio-norm.gct) -
output_yaml: finalized parameter file
- Mertins, P., Mani, D., Ruggles, K., Gillette, M., Clauser, K., Wang, P., Wang, X., Qiao, J., Cao, S., Petralia, F., et al. (2016). Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534(7605), 55 - 62. https://dx.doi.org/10.1038/nature18003.
- Gillette, M., Satpathy, S., Cao, S., Dhanasekaran, S., Vasaikar, S., Krug, K., Petralia, F., Li, Y., Liang, W., Reva, B., et al. (2020). Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma. Cell 182(1), 200 - 225.e35. https://dx.doi.org/10.1016/j.cell.2020.06.013
- Home
- PANOPLY Tutorial
- Data Preparation Modules
-
Data Analysis Modules
- panoply_association
- panoply_blacksheep
- panoply_clumps_ptm_diffexp
- panoply_clumps_ptm
- panoply_clumps_ptm_postprocess
- panoply_cmap_analysis
- panoply_cna_correlation
- panoply_cons_clust
- panoply_immune_analysis
- panoply_metaboanalyst
- panoply_mimp
- panoply_nmf
- panoply_nmf_postprocess
- panoply_omicsev
- panoply_quilts
- panoply_rna_protein_correlation
- panoply_sankey
- panoply_ssgsea
-
Report Modules
- panoply_association_report
- panoply_blacksheep_report
- panoply_clumps_ptm_report
- panoply_cna_correlation_report
- panoply_cons_clust_report
- panoply_immune_analysis_report
- panoply_metaboanalyst_report
- panoply_mimp_report
- panoply_nmf_report
- panoply_normalize_ms_data_report
- panoply_rna_protein_correlation_report
- panoply_sampleqc_report
- panoply_sankey_report
- panoply_ssgsea_report
- Support Modules
- Navigating Results
- PANOPLY without Terra
- Customizing PANOPLY
-
Workflows
- panoply_association_workflow
- panoply_blacksheep_workflow
- panoply_clumps_ptm_workflow
- panoply_immune_analysis_workflow
- panoply_metaboanalyst_workflow
- panoply_nmf_workflow
- panoply_nmf_internal_workflow
- panoply_normalize_filter_workflow
- panoply_process_SM_table
- panoply_sankey_workflow
- panoply_ssgsea_workflow
- Pipelines