MSPPlots¶

import with:

from mspypeline import BasePlotter, MaxQuantPlotter

BasePlotter¶

class mspypeline.BasePlotter(start_dir, reader_data=None, intensity_df_name='', interesting_proteins=None, go_analysis_gene_names=None, configs=None, required_reader=None, intensity_entries=(), loglevel=10)¶

Base plotter to create plots.

The two main methods of the Base plotter comprise “get_” functions to calculate and provide the data for the “plot_” functions. The latter incorporates the “get_” functions as well as functions from the matplotlib backend to combine data calculation, plotting and saving of the results in one method.

__init__(start_dir, reader_data=None, intensity_df_name='', interesting_proteins=None, go_analysis_gene_names=None, configs=None, required_reader=None, intensity_entries=(), loglevel=10)¶

Parameters

start_dir (str) – location to save results
reader_data (Optional[Dict[str, Dict[str, pandas.core.frame.DataFrame]]]) – mapping to provide input data
intensity_df_name (str) – name/key to input data
interesting_proteins (Optional[Dict[str, pandas.core.series.Series]]) – mapping with pathway proteins to analyze
go_analysis_gene_names (Optional[Dict[str, pandas.core.series.Series]]) – mapping with go terms to analyze
configs (Optional[dict]) – mapping of configuration
required_reader (Optional[str]) – name of the file reader
intensity_entries (Tuple[str, str, str]) – tuple of (key in all_tree_dict, prefix in data, name in plot). See add_intensity_column().
loglevel (int) – level of the logger

add_intensity_column(option_name, name_in_file, name_in_plot, scale='normal', df=None)¶

Adds two options to all_intensities_dict and all_tree_dict, called option_name and option_name_log2.

Parameters

option_name (str) – the name that the added data has internally, can be referred to via the df_to_use option e.g. lfq or ibaq
name_in_file (str) – prefix of the columns e.g. Intensity or LFQ intensity
name_in_plot (str) – shown name in the plots e.g. LFQ Intensity or “iBAQ values”
scale (str) – is the data in “normal” or in “log2” scale
df (Optional[pandas.core.frame.DataFrame]) – can be passed to use instead of BasePlotter.intensity_df

add_normalized_option(df_to_use, normalizer, norm_option_name)¶

Adds a new option/key of available data sets in all_intensities_dict and all_tree_dict by taking the data set all_intensities_dict[df_to_use], performing the normalization on the data and then adding the new option with add_intensity_column().

Parameters

df_to_use (str) – data set that should be normalized
normalizer (Union[Type[mspypeline.modules.Normalization.BaseNormalizer], Any]) – normalizer either derived from BaseNormalizer or a class with a fit_transform()
norm_option_name (str) – suffix of the new option name

create_results()¶: Creates all plots that where chosen/set to True in the settings “create plot” (see Analysis settings).

classmethod from_MSPInitializer(mspinit_instance, **kwargs)¶

Creates a BasePlotter from a MSPInitializer.

Parameters

mspinit_instance (mspypeline.core.MSPInitializer.MSPInitializer) – instance of a MSPInitializer used to get correct inputs for the plotter.
kwargs – all kwargs, which are passed to the BasePlotter.__init__() can be overwritten by passing as kwargs.

Returns

functional plotter

Return type

BasePlotter

classmethod from_file_reader(reader_instance, **kwargs)¶

Creates a BasePlotter from a BaseReader (BasePlotter or MaxQuantPlotter).

Parameters

reader_instance (mspypeline.file_reader.BaseReader.BaseReader) – instance of a BaseReader used to get correct inputs for the plotter.
kwargs – all kwargs, which are passed to the BasePlotter.__init__() can be overwritten by passing as kwargs.

Returns

functional plotter

Return type

BasePlotter

get_boxplot_data(df_to_use, level, **kwargs)¶

Get protein intensities for all samples per group of the selected level and then sorts samples by their median intensity.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

Dictionary with key “protein_intensities” to a DataFrame containing the protein intensities per group sorted by median intensity

Return type

Dict

get_detected_proteins_per_replicate_data(df_to_use, level, **kwargs)¶

Counts the number of protein intensity values greater than 0 (number of detected proteins) per sample of a group from the selected level.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

Dictionary with key “all_height” to a mapping of protein counts as Series per group

Return type

Dict

get_detection_counts_data(df_to_use, level, **kwargs)¶

Counts the number of intensity values greater than 0 per protein (number of samples that the protein is detected in) per group of the selected level.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

Dictionary with key “counts” to a DataFrame containing the counts of proteins detected in a sample

Return type

Dict

get_experiment_comparison_data(df_to_use, full_name1, full_name2)¶

Gets protein intensities for all samples of a given group, then calculates the proteins that can be compared between groups and those that are unique for each group (see Thresholds and Comparisons) and takes the mean intensity of these proteins.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
full_name1 (str) – name of the first data node/group that should be compared to ‘full_name2’
full_name2 (str) – name of the second data node/group that should be compared to ‘full_name1’

Returns

Dictionary with keys “protein_intensities_sample1” and “protein_intensities_sample2” to Series containing the mean protein intensities of sample 1 and sample 2 and “exclusive_sample1” and “exclusive_sample2” to Series containing the mean intensities of unique proteins for sample 1 and sample 2.

Return type

Dict

get_go_analysis_data(df_to_use, level)¶

Calculates an enrichment analysis for all samples per group of the selected level and for each given GO list (see plot_go_analysis()). Significances are calculated with a fisher exact test.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared

Returns

Dictionary with keys “heights” to a ddict containing the counts of proteins per sample of each given GO list, “test_results” to a ddict containing the corresponding Fisher’s exact test results and “go_length” to a list containing the total number of proteins of each chosen GO list

Return type

Dict

get_intensity_heatmap_data(df_to_use, level, sort_index=False, sort_index_by_missing=True, sort_columns_by_missing=True, **kwargs)¶

Get the protein intensities for all samples per group of the selected level and sorts samples and proteins according to settings.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
sort_index (bool) – should proteins be sorted alphanumerically
sort_index_by_missing (bool) – should proteins be sorted by number of missing values across samples
sort_columns_by_missing (bool) – should samples be sorted by number of missing values
kwargs – accepts kwargs

Returns

Dictionary with key “intensities” to a DataFrame containing protein intensities of samples

Return type

Dict

get_intensity_histograms_data(df_to_use, level, **kwargs)¶

Get protein intensity values for each sample per group of the selected level.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

Dictionary with key “hist_data” to a DataFrame containing the protein intensity values per group

Return type

Dict

get_kde_data(df_to_use, level, **kwargs)¶

Gets the protein intensities for all samples per group of the selected level.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

Dictionary with key “intensities” to a DataFrame containing the protein intensities per group

Return type

Dict

get_n_protein_vs_quantile_data(df_to_use, level, quantile_range=None, **kwargs)¶

Gets protein intensities for all samples per group, counts the number of intensity values greater than 0 (total number of detected proteins) and the quantiles per sample.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
quantile_range (Optional[numpy.array]) – which quantile range should be used for analysis
kwargs – accepts kwargs

Returns

Dictionary with keys “quantiles” to a DataFrame of calculated quantiles per sample and “n_proteins” to a Series of total number of identified proteins per sample

Return type

Dict

get_pathway_analysis_data(df_to_use, level, pathway, equal_var=True, **kwargs)¶

Filters out all proteins of the given pathways for all samples per group of the selected level, then calculates the pairwise significances between the groups with an independent t-test (see plot_pathway_analysis()) for all those proteins that can be compared (see Thresholds and Comparisons).

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
pathway (str) – which pathway should be analysed
equal_var – should equal variance be assumed
kwargs – accepts kwargs

Returns

Dictionary with keys “protein_intensities” to a DataFrame containing the protein intensities of detected proteins from all given pathways per group and “significances” to a DataFrame containing the calculated significances between groups for each protein of all given pathways

Return type

Dict

get_pca_data(df_to_use, level, n_components=2, fill_value=0, no_missing_values=True, fill_na_before_norm=False, **kwargs)¶

Gets protein intensities for all samples per group processes data according to given arguments and then performs a dimensionality reduction (PCA) using sklearn.decomposition.PCA.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
level (int) – at which level of the data tree should the data be compared
n_components (int) – how many principal components should be calculated
fill_value (float) – if data should be interpolated, which fill value should be used
no_missing_values (bool) – should missing values be neglected
fill_na_before_norm (bool) – if data should be interpolated, should this be done before normalisation
kwargs – accepts kwargs

Returns

Dictionary with keys “pca_data” to a DataFrame containing the output of a PCA using ` sklearn.decomposition and “pca_fit” to a PCA object that was fitted to normalized input data

Return type

Dict

get_r_volcano_data(g1, g2, df_to_use)¶

Gets the protein intensities for all samples of the two given groups, then calculates the proteins that can be
compared between groups and those unique for each group (see Thresholds and Comparisons).
Hands over the protein intensities to be compared to the R package limma that outputs the logFC, p-value,
adjusted p value (Benjamini + Hochberg) and other data which is
calculated based on a moderated t-statistic. P-value
calculations are corrected for the intensity-variance relationship.
Results are converted back to python format afterwards.

Note

This function uses the R package limma which is automatically downloaded the first time this analysis is performed.

Parameters

g1 (str) – first sample that should be analysed (downregulated)
g2 (str) – second sample that should be analysed (upregulated)
df_to_use (str) – which dataframes/intensities should be analysed

Returns

Dictionary with keys “volcano_data” to a DataFrame containing processed output of the limma.eBayes analysis, “unique_g1” and “unique_g2” to Series containing the unique protein intensities per group

Return type

Dict

get_rank_data(df_to_use, full_name, **kwargs)¶

Get protein intensity values of the selected group and rank the proteins by their intensity value.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
full_name (str) – which data node/group of samples should be compared

Returns

Dictionary with key “rank_data” to Series containing the protein intensities of the group ranked by intensity value

Return type

Dict

get_relative_std_data(df_to_use, full_name, **kwargs)¶

Calculate which proteins of a group can be used for the analysis (see Thresholds and Comparisons) and filters proteins below the threshold out.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
full_name (str) – which data node/group of samples should be compared

Returns

Dictionary with key “intensities” to a DataFrame containing the protein intensities of the group

Return type

Dict

get_scatter_replicates_data(df_to_use, full_name)¶

Get protein intensity values for each sample of a selected group.

Parameters

df_to_use (str) – which dataframes/intensities should be analysed
full_name (str) – which data node/group of samples should be compared

Returns

Dictionary with key “scatter_data” to a DataFrame containing the protein intensity values per replicate

Return type

Dict

get_venn_data_per_key(df_to_use, key)¶

Counts the protein intensity values greater than 0 (number of detected proteins) for each replicate of a group from the selected level.

Parameters

df_to_use (str) – which dataframe/intensity should be analysed
key (str) – which data node/group of samples should be compared

Returns

Dictionary containing the proteins detected per sample

Return type

Dict

get_venn_group_data(df_to_use, level, non_na_function=<function get_number_of_non_na_values>)¶

Calculates which proteins can be compared between groups or are unique for a group of the selected level (see Thresholds and Comparisons) and then counts these proteins per group.

Parameters

df_to_use (str) – which dataframe/intensity should be analysed
level (int) – at which level of the data tree should the data be compared
non_na_function – threshold function to determine if proteins can be compared, default: get_number_of_non_na_values()

Returns

Dictionary containing the proteins that can be compared per group

Return type

Dict

plot_all_normalizer_overview(dfs_to_use, levels, plot_function, file_name, normalizers=None, **kwargs)¶

Helper method to create a multi-paged file containing one plot per normalization option.

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use – which dataframes/intensities should be plotted
levels – at which level of the data tree should the data be compared
plot_function – which plot should be created
file_name – name of the file that is crated and saved
normalizers – normalizers either derived from BaseNormalizer or a class with a fit_transform()
kwargs – accepts kwargs

Returns

A list of all created plots

Return type

list

plot_boxplot(dfs_to_use, levels, **kwargs)¶

A standard boxplot displaying the five quantile distribution per group of the selected level and ranking the groups by median intensity from the bottom of the graph to the top.

The plot is created by applying get_pca_data() to get protein intensities for all samples per group of the selected level and the sort samples by their median intensity. Data is plotted and saved using save_boxplot_results()

The boxplot is part of the Normalization overview.

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_boxplot_results()

plot_detected_proteins_per_replicate(dfs_to_use, levels, **kwargs)¶

Uses get_detected_proteins_per_replicate_data() to count the number of protein intensity values greater than 0 (number of detected proteins) per sample of a group from the selected level.

The data is plotted and saved using save_detected_proteins_per_replicate_results() as bar diagram showing the number of detected proteins per sample as well as the total number of detected proteins for each group of a selected level.

The average number of detected proteins per group is indicated as gray dashed line.

To view adjustable parameters see “plot_detected_proteins_per_replicate_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

plot_detection_counts(dfs_to_use, levels, **kwargs)¶

Uses get_detection_counts_data() to count the number of intensity values > 0 per protein (number of samples that the protein is detected in) per group of the selected level.

The data is plotted and saved using save_detection_counts_results() as a bar diagram showing how often proteins are detected in a number of samples/replicates for each group.

To view adjustable parameters see “plot_detection_counts_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_detection_counts_results()

plot_experiment_comparison(dfs_to_use, levels, **kwargs)¶

To generate the experiment comparison plot, the function get_experiment_comparison_data() is used to retrieve protein intensity values for all samples of a given group and to classify those proteins that can be compared between groups and those that are unique for each group (see Thresholds and Comparisons). Then the the mean intensity of these proteins is calculated.

For all groups of the selected level, pairwise comparisons of the protein intensities are plotted and their Pearson’s correlation coefficient r^2 is calculated.

Unique proteins per group are shown at the bottom and right side of the graph (substitution of missing values by the minimum value of the data set).

The calculated Pearson’s correlation coefficient r^2 is additionally visualized in form of a correlation heatmap.

For every pairwise comparison of the groups from the selected level, one scatter plot is created and the results of all pairwise comparisons together are visualized in one combined correlation heatmap using save_scatter_replicates_results().

To view adjustable parameters see “plot_experiment_comparison_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Note

To determine which proteins can be compared between the two groups and which are unique for one group an internal threshold function is applied.

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_experiment_comparison_results()

plot_go_analysis(dfs_to_use, levels, **kwargs)¶

In the GO analysis, an enrichment analysis is performed for each selected GO Term file (based on protein counts = proteins with intensity value > 0). For this analysis get_go_analysis_data() is used to calculate the number of detected proteins from a GO term that are found in each group of the selected level. The data is illustrated as the length of the corresponding bar. P values shown at the end of a bar indicate the calculated significance. Samples referred to as “Total” represent the complete data set and numbers at the top of the graph accord to the count of detected proteins in all samples over the total number of proteins in the GO term. The data of all chosen pathways is plotted and saved in one graph using save_go_analysis_results()

For p-value calculation, first, for each GO term, a list “pathway_genes” is created by taking the intersection of the proteins from the GO list and the total detected proteins.

Secondly, a list of “non_pathway_genes” is created which comprises total detected proteins but proteins in “pathway_genes”.

Third, a list of “experiment_genes” and “non_experiment_genes” is created in a similar fashion where an experiment references to a sample/group of samples of the data set.

Lastly, a one-tailed fisher exact test is calculated to retrieve statistical significances based on the following contingency table:

in pathway

not in pathway

in experiment

experiment_genes & pathway_genes

experiment_genes & not_pathway_genes

not in experiment

not_experiment_genes & pathway_genes

not_experiment_genes & not_pathway_genes

The resulting p-value is thus, also dependent on the overall protein count of the sample/group of samples. A sample is considered significant if the p value is > 0.05.

To view adjustable parameters see “plot_go_analysis_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_go_analysis_results()

plot_heatmap_overview_all_normalizers(dfs_to_use, levels, **kwargs)¶

Creates the intensity heatmap overview for all normalization methods.
The intensity heatmap demonstrates protein intensities, where samples are given in rows on the y axis and
proteins on the x axis. Missing values are colored in gray.
The heatmap can be used to spot patterns in the different normalization methods and to
understand how different intensity types affect the data.

To view adjustable parameters see “plot_heatmap_overview_all_normalizers_settings:” in the Adjustable Options Configs
For overview of plots see analysis options
For exemplary plot see gallery

Parameters

dfs_to_use – which dataframes/intensities should be plotted
levels – at which level of the data tree should the data be compared
kwargs – accepts kwargs

plot_intensity_heatmap(dfs_to_use, levels, **kwargs)¶

The intensity heatmap demonstrates protein intensities (derived from get_intensity_heatmap_data()), where samples are given in rows on the y axis and proteins on the x axis. Missing values are colored in gray. The data is plotted and saved using save_intensities_heatmap_result().

The heatmap can be used to spot patterns in the different normalization methods and to understand how different intensity types affect the data.

The Heatmap overview is created from a series of intensity heatmap plots.

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_intensities_heatmap_result()

plot_intensity_histograms(dfs_to_use, levels, **kwargs)¶

Uses get_intensity_histograms_data() to get protein intensity values for each sample per group of the selected level.

The intensity values of each sample are binned (default = 25) and the data of each sample from a group of the selected level is plotted and saved in one histogram using save_intensity_histogram_results().

If the parameter “show_mean” is set to True in the configs the mean intensity of the plotted samples of a group is shown as gray dashed line.

To view adjustable parameters see “plot_intensity_histograms_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_intensity_histogram_results()

plot_kde(dfs_to_use, levels, **kwargs)¶

In the kernel density estimate (KDE) plot, one density graph per sample is plotted indicating the intensity (derived from get_kde_data()) on the x axis and the density on the y axis. The data is plotted and saved using save_kde_results().

These plots should be presented on a log2 scale.

The KDE is well suited to study the influence of different normalization methods and protein intensities on the data which is why it is part if the Normalization overview.

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_kde_results()

plot_n_proteins_vs_quantile(dfs_to_use, levels, **kwargs)¶

Plots the quantile protein intensities against the number of identified proteins per sample. get_n_protein_vs_quantile_data() is used to get protein intensities for all samples per group and subsequently count the number of intensity values > 0 (total number of detected proteins) and the quantiles per sample. The data is visualized and saved by save_n_proteins_vs_quantile_results().

Samples are indicated as a horizontal line of scatter dots where the color anf x position of a dot indicate the intensity value of the respective quantile. The y position of the dots of a sample point to the total number of detected proteins in that sample.

Solid, rather vertical lines indicate a linear fit of each quantile for all the samples.

This plot is part of the Normalization overview.

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_n_proteins_vs_quantile_results()

plot_normalization_overview(dfs_to_use, levels, **kwargs)¶

The Normalization overview offers the opportunity to examine different aspects of the data in three distinct plots. For each normalization method provided an additional page is attached to the resulting pdf file starting with the raw or not normalized data. That way it is possible to get a better understanding of the effects of the normalization methods on the data, to inspect the different approaches and to find the best suitable normalization for the data.

The normalization overview combines the plots plot_kde() (see KDE example), plot_n_proteins_vs_quantile() (see proteins vs quantiles example) and plot_boxplot() (see boxplot example).

To view adjustable parameters see “plot_normalization_overview_all_normalizers_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_normalization_overview_results()

plot_normalization_overview_all_normalizers(dfs_to_use, levels, **kwargs)¶

Creates the plot_normalization_overview() for all normalization methods.

To view adjustable parameters see “plot_normalization_overview_all_normalizers_settings:” in the Adjustable Options Configs
For overview of plots see analysis options
For exemplary plot see gallery

Parameters

dfs_to_use – which dataframes/intensities should be plotted
levels – at which level of the data tree should the data be compared
kwargs – accepts kwargs

plot_pathway_analysis(dfs_to_use, levels, **kwargs)¶

In the pathway analysis, for each protein of a desired pathway a subplot is created displaying the intensities of the protein for all groups of the selected level.

First, get_pathway_analysis_data() is used to filter out all proteins of the desired pathways for all samples per group of the selected level. The function then determines which of those proteins can be compared between samples (see Thresholds and Comparisons) and significances of these protein intensities are calculated for each pairwise comparison between groups with an independent t-test. P value thresholds are set to the following: * is p < 0.05, ** is p < 0.005, and *** is p < 0.0005. For every selected pathway, two figures are created and saved using save_pathway_analysis_results(), one displaying the significances and the other not displaying them.

For a group of multiple samples, the protein intensity is plotted for each sample (single scatter dot) which are jointly presented in uniform coloring.

To view adjustable parameters see “plot_pathway_analysis_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Note

To determine which proteins can be compared between two groups an internal threshold function is applied.

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_pathway_analysis_results()

plot_pathway_timecourse(df_to_use='raw', show_suptitle=False, levels=2, **kwargs)¶

not yet implemented

Parameters

df_to_use (str) –
show_suptitle (bool) –
levels (Iterable) –

plot_pca_overview(dfs_to_use, levels, **kwargs)¶

With the option to perform PCA, data can be studied for its variance and in doing so, parameters can be determined that have most strongly affected the variability between samples. The created PCA compares all components against each other (default = 2 components).

PCA results are calculated using get_pca_data() that gets protein intensities for all samples per group, processes data according to the given arguments, and then performs a dimensionality reduction (PCA) using sklearn.decomposition.PCA. Multiple different analysis options can be chosen to generate a PCA (see: multiple option config).

The results do not change in dependence on the chosen level, however, determining the level on which the data should be compared influences the coloring of the scatter elements. Each group of the selected level is colored differently. The data is plotted and saved using save_pca_results().

To view adjustable parameters see “plot_pca_overview_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

plot_pca_overview_settings:

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_pca_results()

plot_r_volcano(dfs_to_use, levels, sample1=None, sample2=None, **kwargs)¶

A volcano plot illustrates the statistical inferences from a pairwise comparison of the two groups.
The plot shows the log2 fold change between two different conditions against the -log10(p-value)
(based on protein intensities). The p-value and adjusted p-value ((Benjamini + Hochberg) are determined using the R
limma package (moderated t-statistic). Additionally,
calculations are corrected for the intensity-variance relationship. For the calculation
of all these parameters get_r_volcano_data() is applied.
Dashed lines indicate the fold change cutoff (default = log2(2) and p-value cutoff (default = p < 0.05) by
which proteins are considered significant (blue and red) or non significant (gray). Measured intensities of
unique proteins are indicated at the sides of the volcano plot for each groups (light blue and orange).
Volcano plots also permit the annotation of mapped proteins. This can be achieved by labeling a number of
the most significant proteins for each group or by selecting a
pathway analysis protein list.
For every pairwise comparison of the groups of the selected level two volcano plots are created and saved,
using :func:`~mspypeline.plotting_backend.matplotlib_plots.save_volcano_results’, where one plot has a set of
proteins annotated and the other does not.

To view adjustable parameters see “plot_r_volcano_settings:” in the Adjustable Options Configs
For overview of plots see analysis options
For exemplary plot see gallery

Note

should be used with log2 intensities
minimum of 3 samples per group required

Note

To determine which proteins can be compared between the two groups and which are unique for one group an internal threshold function is applied.

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
sample1 (str) – first sample that should be compared (downregulated)
sample2 (str) – second sample that should be compared (upregulated)
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

plot_rank(dfs_to_use, levels, **kwargs)¶

In the rank plot all proteins are sorted by intensity value using get_rank_data() and plotted against their rank. For every group of the selected level one plot is created and saved by save_rank_results(), averaging the protein intensities of the replicates of a group.

The highest intensity accounts for rank 0, the lowest intensity for the number of proteins - 1 whereby proteins with missing values are neglected. The median intensity of all proteins is given in the legend.

Pathway analysis protein lists can be applied to the rank plot to provide information about the median intensity or rank of pathways of interest. If a protein is part of a selected pathway it is presented in color and the median rank of all proteins of a given pathway is indicated. Multiple pathways can be selected and and are consequently represented in the same graph as distinct groups.

To view adjustable parameters see “plot_rank_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_rank_results()

plot_relative_std(dfs_to_use, levels, **kwargs)¶

Illustrates the relative standard deviation of the proteins between samples of a group which can help to understand how much fluctuation of the measured intensities is present between the replicates. Low deviation indicates that measured intensities are stable over multiple samples.

For each group of the selected level one plot is created.

The method applies get_relative_std_data() to calculate which proteins of a group can be used for the analysis (see Thresholds and Comparisons) and to filter out proteins below the threshold. Then, save_relative_std_results() is used to calculate the relative standard deviation and plot and save the data.

Lines drawn in different shades of blue indicate arbitrary chosen thresholds of 10%, 20% and 30% of the relative std and the number of proteins with a relative std below these values.

To view adjustable parameters see “plot_relative_std_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Note

To determine which proteins can be compared between the two samples an internal threshold function is applied.

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_relative_std_results()

plot_scatter_replicates(dfs_to_use, levels, **kwargs)¶

Uses get_scatter_replicates_data() to retrieve protein intensity values for each sample of a selected group.

For all samples/replicates per group of the selected level, pairwise comparisons of the protein intensities are plotted and their Pearson’s correlation coefficient r^2 is calculated.

Unique proteins per replicate are shown at the bottom and right side of the graph (replacement of NA values by min value of data set).

The calculated Pearson’s correlation coefficient r^2 is additionally visualized in form of a correlation heatmap.

For a group with more than 2 replicates, each pairwise comparison of the replicates is calculated and plotted together in one graph. For every group of the selected level one scatter plot and one correlation heatmap is created and saved using save_scatter_replicates_results().

To view adjustable parameters see “plot_scatter_replicates_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_scatter_replicates_results()

plot_venn_groups(dfs_to_use, levels, **kwargs)¶

Venn diagrams conduce the graphical illustration of set theory. In the mspypeline protein counts (proteins with an intensity value > 0) constitute the sets and set relationships indicate the number of proteins that are shared between two or more sets. Thereby the similarity of detected proteins of a set can be assessed.

The function get_venn_group_data() is used to calculate which proteins can be compared between groups or are unique for a group of the selected level (see Thresholds and Comparisons) and then counts these proteins per group.

The method then creates and saves both a venn diagram using save_venn() and a bar-venn diagram using save_bar_venn() comparing the similarity of the groups on the selected level (based on protein counts). The ordinary venn diagram is quite intuitive, but it supports a maximum of three comparisons in the mspypeline. The bar-venn diagram holds the advantage of allowing an unlimited number of comparison sets. These figures consists of two combined graphs, an upper bar diagram, tha indicates the number of unique or shared proteins of a set or overlapping sets. The lower graph indicates which set or sets are being compared, respectively, which protein count (upper graph) belongs to which comparison (lower graph).

To view adjustable parameters see “plot_venn_groups_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Note

A venn diagram can compare a maximum of 3 samples.

A bar-venn diagram can compare more than 3 samples.

If the selected level has more than 3 groups, only the bar-venn diagram is created.

If the selected level has more than 6 groups no diagram is created

Note

To determine which proteins can be compared between the groups and which are unique for one group an internal threshold function is applied.

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_venn(), save_bar_venn()

plot_venn_results(dfs_to_use, levels, **kwargs)¶

Venn diagrams conduce the graphical illustration of set theory. In the mspypeline protein counts (greater than zero) constitue the sets and set relationships indicate the number of proteins that are shared between two or more sets. Thereby the similarity of detected proteins of a set can be assessed. | The function get_venn_data_per_key() is used to count the protein intensity values > 0 (number of detected proteins) for each replicate of a group from the selected level.

The method creates and saves both a venn diagram using save_venn() and a bar-venn diagram using save_bar_venn() comparing the similarity of the replicates of each group from the selected level (based on protein counts). The ordinary venn diagram is quite intuitive, but it supports a maximum of three comparisons in the mspypeline. The bar-venn diagram holds the advantage of allowing an unlimited number of comparison sets. These figures consists of two combined graphs, an upper bar diagram, tha indicates the number of unique or shared proteins of a set or overlapping sets. The lower graph indicates which set or sets are being compared, respectively, which protein count (upper graph) belongs to which comparison (lower graph).

To view adjustable parameters see “plot_venn_results_settings:” in the Adjustable Options Configs

For overview of plots see analysis options

For exemplary plot see gallery

Note

A venn diagram can compare a maximum of 3 samples.

A bar-venn diagram can compare more than 3 samples.

If a group of the selected level has more than 3 replicates, only the bar-venn diagram is created.

If the selected level has more than 6 groups no diagram is created

Parameters

dfs_to_use (Union[str, Iterable[str]]) – which dataframes/intensities should be plotted
levels (Union[int, Iterable[int]]) – at which level of the data tree should the data be compared
kwargs – accepts kwargs

Returns

A list of all created plots.

Return type

List

See also

save_venn(), save_bar_venn()

MaxQuantPlotter¶

class mspypeline.MaxQuantPlotter(start_dir, reader_data, intensity_df_name='proteinGroups', interesting_proteins=None, go_analysis_gene_names=None, configs=None, required_reader='mqreader', intensity_entries='raw', 'Intensity ', 'Intensity', 'lfq', 'LFQ intensity ', 'LFQ intensity', 'ibaq', 'iBAQ ', 'iBAQ intensity', loglevel=10)¶

MaxQuant Plotter is a child class of the BasePlotter and inherits all functionality to get data and generate plots.

__init__(start_dir, reader_data, intensity_df_name='proteinGroups', interesting_proteins=None, go_analysis_gene_names=None, configs=None, required_reader='mqreader', intensity_entries='raw', 'Intensity ', 'Intensity', 'lfq', 'LFQ intensity ', 'LFQ intensity', 'ibaq', 'iBAQ ', 'iBAQ intensity', loglevel=10)¶

Parameters

start_dir (str) – location to save results
reader_data (dict) – mapping to provide input data
intensity_df_name (str) – name/key to input data
interesting_proteins (dict) – mapping with pathway proteins to analyze
go_analysis_gene_names (dict) – mapping with go terms to analyze
configs (dict) – mapping of configuration
required_reader – name of the file reader
intensity_entries – tuple of (key in all_tree_dict, prefix in data, name in plot). See add_intensity_column().
loglevel – level of the logger

create_report(target_dir=None)¶

Creates a MaxQuantReport.pdf, which can be used as quality control.

For overview of plots see analysis options

For exemplary plot see gallery

Parameters: target_dir (str) – directory where report will be written

classmethod from_MSPInitializer(mspinit_instance, **kwargs)¶

Creates a BasePlotter from a MSPInitializer.

Parameters

mspinit_instance (mspypeline.core.MSPInitializer.MSPInitializer) – instance of a MSPInitializer used to get correct inputs for the plotter.
kwargs – all kwargs, which are passed to the BasePlotter.__init__() can be overwritten by passing as kwargs.

Returns

functional plotter

Return type

BasePlotter

classmethod from_file_reader(reader_instance, **kwargs)¶

Creates a BasePlotter from a BaseReader (BasePlotter or MaxQuantPlotter).

Parameters

reader_instance (mspypeline.core.MSPPlots.MaxQuantPlotter.MQReader) – instance of a BaseReader used to get correct inputs for the plotter.
kwargs – all kwargs, which are passed to the BasePlotter.__init__() can be overwritten by passing as kwargs.

Returns

functional plotter

Return type

BasePlotter