Modules

import with:

from mspypeline import MedianNormalizer, QuantileNormalizer, TailRobustNormalizer, interpolate_data
from mspypeline.modules.Normalization import BaseNormalizer
from mspypeline import DataNode, DataTree

Normalization

class mspypeline.modules.Normalization.BaseNormalizer(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)

Abstract base class for Normalizers. Derived normalizers should implement the fit() and transform().

__init__(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)
Parameters
  • input_scale (str) – Scale of the input data. Either normal or log2

  • output_scale (str) – Scale of the output data. Either normal or log2

  • col_name_prefix (Optional[str]) – If not None the prefix is added to each column name

  • loglevel (int) – loglevel of the logger

  • kwargs – accepts kwargs

abstract fit(data)

Abstract fit method. Should return self.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

The normalizer instance.

Return type

self

fit_transform(data)

Chains the fit and transform method.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

transformed data

Return type

DataFrame

abstract transform(data)

Abstract transform method. Should return transformed data.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

transformed data

Return type

DataFrame

class mspypeline.MedianNormalizer(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)

Median normalizer, which calculates the median protein intensity for each sample (column). The mean of all sample-wise medians is calculated and subtracted from each sample median. This correction factor is then subtracted from each protein intensity.

__init__(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)
Parameters
  • input_scale (str) – Scale of the input data. Either normal or log2

  • output_scale (str) – Scale of the output data. Either normal or log2

  • col_name_prefix (Optional[str]) – If not None the prefix is added to each column name

  • loglevel (int) – loglevel of the logger

  • kwargs – accepts kwargs

fit(data)

Abstract fit method. Should return self.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

The normalizer instance.

Return type

self

transform(data)

Abstract transform method. Should return transformed data.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

transformed data

Return type

DataFrame

class mspypeline.QuantileNormalizer(missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)

Quantile Normalizer, which first ranks proteins after their intensity value for each sample (column). The mean protein intensity per quantile across all samples is calculated and assigned to every protein of each sample. The data is rearranged to the original order of the intensity values for each sample. For more in depth description see: https://en.wikipedia.org/wiki/Quantile_normalization

__init__(missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)
Parameters
  • missing_value_handler (Optional[Callable]) – function to fill missing values

  • input_scale (str) – Scale of the input data. Either normal or log2

  • output_scale (str) – Scale of the output data. Either normal or log2

  • col_name_prefix (Optional[str]) – If not None the prefix is added to each column name

  • loglevel (int) – loglevel of the logger

  • kwargs – accepts kwargs

fit(data)

Abstract fit method. Should return self.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

The normalizer instance.

Return type

self

transform(data)

Abstract transform method. Should return transformed data.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

transformed data

Return type

DataFrame

class mspypeline.TailRobustNormalizer(normalizer=<class 'mspypeline.modules.Normalization.QuantileNormalizer'>, missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)

Tail Robust Normalizer, which first calculates an offsetting factor by taking the sample-wise mean and is subtracted from each protein of the respective sample (column). A Normalization is applied, and the respective offset value is added back to each protein of the sample. The performed calculation is an abstracted implementation of the Tail Robust Quantile Normalization as described here: https://www.biorxiv.org/content/10.1101/2020.04.17.046227v1.full .

__init__(normalizer=<class 'mspypeline.modules.Normalization.QuantileNormalizer'>, missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)
Parameters
  • normalizer (Type[mspypeline.modules.Normalization.BaseNormalizer]) – a normalizer that should be used in combination with this normalizer

  • missing_value_handler (Optional[Callable]) – function to fill missing values

  • input_scale (str) – Scale of the input data. Either normal or log2

  • output_scale (str) – Scale of the output data. Either normal or log2

  • col_name_prefix (Optional[str]) – If not None the prefix is added to each column name

  • loglevel (int) – loglevel of the logger

  • kwargs – accepts kwargs

fit(data)

Abstract fit method. Should return self.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

The normalizer instance.

Return type

self

transform(data)

Abstract transform method. Should return transformed data.

Parameters

data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.

Returns

transformed data

Return type

DataFrame

mspypeline.interpolate_data(data)

Performs interpolation of missing values (protein int = 0) on the data by sampling from the same distribution as the input distribution. Adopted from https://github.com/bmbolstad/preprocessCore, more specifically: https://github.com/bmbolstad/preprocessCore/blob/master/src/qnorm.c

Parameters

data (pandas.core.frame.DataFrame) – A DataFrame with columns being the samples and rows and being the features

Returns

filled data where all values have been replaced by interpolating from the old data column wise. For the non missing entries the new values are very close to the old, while the for the missing entries a sampled value is assigned

Return type

DataFrame

DataStructure

class mspypeline.DataNode(name='', level=0, parent=None, data=None, children=None)
__init__(name='', level=0, parent=None, data=None, children=None)

Default parameters will return a root node

Parameters
  • name (str) – Name of the node

  • level (int) – depth of the node

  • parent (Optional[mspypeline.modules.DataStructure.DataNode]) – Parent of this node

  • data (pandas.core.series.Series) – Is None when there are nodes below this one, which were not aggregated as technical replicated

  • children (Dict[str, DataNode]) – Maps name of a child to a child node

See also

DataTree()

A class to help construct a node structure from data

aggregate(method='mean', go_max_depth=False, index=None)
Parameters
  • method (Union[None, str, Callable]) – If None no aggregation will be applied. Otherwise needs to be accepted by pd.aggregate.

  • go_max_depth (bool) – If technical replicates were aggregated, this can be specified to use the unaggregated values instead.

  • index (Union[str, pandas.core.indexes.base.Index, None]) – Index to subset the data with. If None no index is applied

Returns

Result of the aggregation

Return type

Union[pd.Series, pd.DataFrame]

get_total_number_children(go_max_depth=False)

Gets the number of children containing data below this node. If go_max_depth will search for the deepest DataNodes.

Parameters

go_max_depth (bool) – default false

Returns

The number of all children below this node

Return type

int

groupby(method='mean', go_max_depth=False, index=None)

consider each child a group then aggregate all children

Parameters
  • method (Union[str, Callable]) – Will be passed to aggregate.

  • go_max_depth (bool) – Will be passed to aggregate.

  • index (Union[None, str, pandas.core.indexes.base.Index]) – Will be passed to aggregate.

Returns

Result of the grouping

Return type

data

See also

aggregate()

Will be called on each of the groups

class mspypeline.DataTree(root)

Data Structure in which the experiment is stored. Each leaf node is a DataNode

root

Does stuff

Type

DataNode

level_keys_full_name

Has all DataNode.full_name of a depth level of all levels

Type

Dict[int, List[str]]

__init__(root)
Parameters

root (mspypeline.modules.DataStructure.DataNode) – The root node of the Tree.

add_data(data)
Parameters

data (pandas.core.frame.DataFrame) – Data which will be used to fill the nodes with a Series. The column names of the data need to be the same as the full names of the DataNode.

aggregate(key=None, method='mean', go_max_depth=False, index=None)
Parameters
  • key (Optional[str]) –

  • method (Union[None, str, Callable]) –

  • go_max_depth (bool) –

  • index (Optional) –

Return type

Union[pandas.core.series.Series, pandas.core.frame.DataFrame]

aggregate_technical_replicates()

Aggregates the deepest level to one level above by using aggregate

classmethod from_analysis_design(analysis_design, data=None, should_aggregate_technical_replicates=True)
Parameters
  • analysis_design (dict) – nested dict

  • data (Union[None, pandas.core.frame.DataFrame]) – Will be passed to add_data. If None no data is added

  • should_aggregate_technical_replicates (bool) – If True the lowest level of the analysis design is considered as a technical replicate and averaged

Returns

Return type

cls

See also

add_data()

will be called if data is not None

aggregate_technical_replicates()

will be called if should_aggregate_technical_replicates

groupby(key_or_index=None, new_col_name=None, method='mean', go_max_depth=False, index=None)
Parameters
  • key_or_index (Union[None, str, int]) –

  • new_col_name (str) –

  • method (Union[None, str, Callable]) –

  • go_max_depth (bool) –

  • index

Return type

Union[pandas.core.series.Series, pandas.core.frame.DataFrame]