File Readers¶

Import with:

from mspypeline import BaseReader, MQReader

BaseReader¶

class mspypeline.BaseReader(start_dir, reader_config, loglevel=10)¶

Base reader to provide a data dictionary with keys to the data. Data stored on system hardware, is thus only loaded on demand. This is the parent class of any file reader that will be use to preprocess data to the internal format.

Example

>>> # example for a new custom reader
>>> class CustomReader(BaseReader):
...     name = "reader"  # this is the name of the reader in the yaml file
...     required_files = []  # this is a list of strings of all files that should be parsed
...     plotter = BasePlotter
...
...     def __init__(self, start_dir, reader_config, loglevel):
...         super().__init__(start_dir, reader_config, loglevel)
...         for file in Reader.required_files:
...             self.full_data[file] = [0, 0, 10]  # this should be the data from the file
>>> r = Reader("", {}, 10)

__init__(start_dir, reader_config, loglevel=10)¶

Parameters

start_dir (str) – location where the directory/txt folder to the data can be found.
reader_config (dict) – mapping of the file reader configuration (as e.g. given in the config.yml file)
loglevel (int) – level of the logger

MQReader¶

class mspypeline.MQReader(start_dir, reader_config, index_col='Gene name', duplicate_handling='sum', drop_columns=None, loglevel=10)¶

A child class of the BaseReader.
The MQReader preprocesses data from MaxQuant files into the internal data format to provide the correct input
for the plotters. Required files to start the MQReader is the proteinGroups.txt file from MaxQuant.
Additionally, the file reader can preprocess the evidence, msmsScans, msScans, parameters, peptides and
summary txt files from the MaxQuant output.
The reader also recognizes sample_mapping.txt files if provided and corrects the
sample naming for instance in the case of naming convention violation (see Analysis Design).

__init__(start_dir, reader_config, index_col='Gene name', duplicate_handling='sum', drop_columns=None, loglevel=10)¶

Parameters

start_dir (str) – location where the directory/txt folder to the data can be found.
reader_config (dict) – mapping of the file reader configuration (as e.g. given in the config.yml file)
index_col (str) – with which identification type should detected proteins in the proteinGroups.txt file be handled. If provided in the reader_config will be taken from there.
duplicate_handling (str) – how should proteins with duplicate index_col be treated ? can be “sum” or “drop”. If provided in the reader_config will be taken from there.
drop_columns (Union[list, tuple, str]) – samples to be excluded from the analysis. If provided in the reader_config will be taken from there.
loglevel (int) – level of the logger

plotter¶: alias of mspypeline.core.MSPPlots.MaxQuantPlotter.MaxQuantPlotter

preprocess_contaminants()¶

Preprocess the proteinGroups.txt file to internal format and return DataFrame with all those proteins marked as contaminant.

Contaminants are defined as those proteins “Only identified by site”, marked as “Reverse” or as “Potential contaminant” in the proteinGroups.txt file.

Returns: DataFrame containing preprocessed data of contaminants from proteinGroups.txt file
Return type: DataFrame

preprocess_evidence()¶

Preprocess the evidence.txt file to internal format and return DataFrame with all those peptides not marked as contaminant.

Contaminants are defined as those peptides marked as “Reverse” or as “Potential contaminant” in the evidence.txt file.

Returns: DataFrame containing preprocessed data from evidence.txt file
Return type: DataFrame

preprocess_msScans()¶

Preprocess the msScans.txt file to internal format and return DataFrame.

Only columns “Raw file”, “Total ion current” and “Retention time” are read in.

Returns: DataFrame containing preprocessed data from msScans.txt file
Return type: DataFrame

preprocess_msmsScans()¶

Preprocess the msmsScans.txt file to internal format and return DataFrame.

Only columns “Raw file”, “Total ion current” and “Retention time” are read in.

Returns: DataFrame containing preprocessed data from msmsScans.txt file
Return type: DataFrame

preprocess_parameters()¶

Preprocess the parameters.txt file to internal format and return DataFrame.

Returns: DataFrame containing preprocessed data from parameters.txt file
Return type: DataFrame

preprocess_peptides()¶

Preprocess the peptides.txt file to internal format and return DataFrame with all those peptides not marked as contaminant.

Contaminants are defined as those peptides marked as “Reverse” or as “Potential contaminant” in the peptides.txt file.

Returns: DataFrame containing preprocessed data from peptides.txt file
Return type: DataFrame

preprocess_proteinGroups()¶

Preprocess the proteinGroups.txt file to internal format and return DataFrame with all those proteins not marked as contaminant.

Contaminants are defined as those proteins “Only identified by site”, marked as “Reverse” or as “Potential contaminant” in the proteinGroups.txt file.

Returns: DataFrame containing preprocessed data from proteinGroups.txt file
Return type: DataFrame

preprocess_summary()¶

Preprocess the summary.txt file to internal format and return DataFrame.

Returns: DataFrame containing preprocessed data from summary.txt file
Return type: DataFrame