Modules¶

import with:

from mspypeline import MedianNormalizer, QuantileNormalizer, TailRobustNormalizer, interpolate_data
from mspypeline.modules.Normalization import BaseNormalizer
from mspypeline import DataNode, DataTree

Normalization¶

class mspypeline.modules.Normalization.BaseNormalizer(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Abstract base class for Normalizers. Derived normalizers should implement the fit() and transform().

__init__(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Parameters

input_scale (str) – Scale of the input data. Either normal or log2
output_scale (str) – Scale of the output data. Either normal or log2
col_name_prefix (Optional[str]) – If not None the prefix is added to each column name
loglevel (int) – loglevel of the logger
kwargs – accepts kwargs

abstract fit(data)¶

Abstract fit method. Should return self.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: The normalizer instance.
Return type: self

fit_transform(data)¶

Chains the fit and transform method.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: transformed data
Return type: DataFrame

abstract transform(data)¶

Abstract transform method. Should return transformed data.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: transformed data
Return type: DataFrame

class mspypeline.MedianNormalizer(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Median normalizer, which calculates the median protein intensity for each sample (column). The mean of all sample-wise medians is calculated and subtracted from each sample median. This correction factor is then subtracted from each protein intensity.

__init__(input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Parameters

input_scale (str) – Scale of the input data. Either normal or log2
output_scale (str) – Scale of the output data. Either normal or log2
col_name_prefix (Optional[str]) – If not None the prefix is added to each column name
loglevel (int) – loglevel of the logger
kwargs – accepts kwargs

fit(data)¶

Abstract fit method. Should return self.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: The normalizer instance.
Return type: self

transform(data)¶

Abstract transform method. Should return transformed data.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: transformed data
Return type: DataFrame

class mspypeline.QuantileNormalizer(missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Quantile Normalizer, which first ranks proteins after their intensity value for each sample (column). The mean protein intensity per quantile across all samples is calculated and assigned to every protein of each sample. The data is rearranged to the original order of the intensity values for each sample. For more in depth description see: https://en.wikipedia.org/wiki/Quantile_normalization

__init__(missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Parameters

missing_value_handler (Optional[Callable]) – function to fill missing values
input_scale (str) – Scale of the input data. Either normal or log2
output_scale (str) – Scale of the output data. Either normal or log2
col_name_prefix (Optional[str]) – If not None the prefix is added to each column name
loglevel (int) – loglevel of the logger
kwargs – accepts kwargs

fit(data)¶

Abstract fit method. Should return self.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: The normalizer instance.
Return type: self

transform(data)¶

Abstract transform method. Should return transformed data.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: transformed data
Return type: DataFrame

class mspypeline.TailRobustNormalizer(normalizer=<class 'mspypeline.modules.Normalization.QuantileNormalizer'>, missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Tail Robust Normalizer, which first calculates an offsetting factor by taking the sample-wise mean and is subtracted from each protein of the respective sample (column). A Normalization is applied, and the respective offset value is added back to each protein of the sample. The performed calculation is an abstracted implementation of the Tail Robust Quantile Normalization as described here: https://www.biorxiv.org/content/10.1101/2020.04.17.046227v1.full .

__init__(normalizer=<class 'mspypeline.modules.Normalization.QuantileNormalizer'>, missing_value_handler=<function interpolate_data>, input_scale='log2', output_scale='normal', col_name_prefix=None, loglevel=10, **kwargs)¶

Parameters

normalizer (Type[mspypeline.modules.Normalization.BaseNormalizer]) – a normalizer that should be used in combination with this normalizer
missing_value_handler (Optional[Callable]) – function to fill missing values
input_scale (str) – Scale of the input data. Either normal or log2
output_scale (str) – Scale of the output data. Either normal or log2
col_name_prefix (Optional[str]) – If not None the prefix is added to each column name
loglevel (int) – loglevel of the logger
kwargs – accepts kwargs

fit(data)¶

Abstract fit method. Should return self.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: The normalizer instance.
Return type: self

transform(data)¶

Abstract transform method. Should return transformed data.

Parameters: data (pandas.core.frame.DataFrame) – Should be a DataFrame or ndarray.
Returns: transformed data
Return type: DataFrame

mspypeline.interpolate_data(data)¶

Performs interpolation of missing values (protein int = 0) on the data by sampling from the same distribution as the input distribution. Adopted from https://github.com/bmbolstad/preprocessCore, more specifically: https://github.com/bmbolstad/preprocessCore/blob/master/src/qnorm.c

Parameters: data (pandas.core.frame.DataFrame) – A DataFrame with columns being the samples and rows and being the features
Returns: filled data where all values have been replaced by interpolating from the old data column wise. For the non missing entries the new values are very close to the old, while the for the missing entries a sampled value is assigned
Return type: DataFrame

DataStructure¶

class mspypeline.DataNode(name='', level=0, parent=None, data=None, children=None)¶

__init__(name='', level=0, parent=None, data=None, children=None)¶

Default parameters will return a root node

Parameters

name (str) – Name of the node
level (int) – depth of the node
parent (Optional[mspypeline.modules.DataStructure.DataNode]) – Parent of this node
data (pandas.core.series.Series) – Is None when there are nodes below this one, which were not aggregated as technical replicated
children (Dict[str, DataNode]) – Maps name of a child to a child node