File Readers

Import with:

from mspypeline import BaseReader, MQReader

BaseReader

class mspypeline.BaseReader(start_dir, reader_config, loglevel=10)
Base reader to provide a data dictionary with keys to the data. Data stored on system hardware, is thus only loaded on demand. This is the parent class of any file reader that will be use to preprocess data to the internal format.

Example

>>> # example for a new custom reader
>>> class CustomReader(BaseReader):
...     name = "reader"  # this is the name of the reader in the yaml file
...     required_files = []  # this is a list of strings of all files that should be parsed
...     plotter = BasePlotter
...
...     def __init__(self, start_dir, reader_config, loglevel):
...         super().__init__(start_dir, reader_config, loglevel)
...         for file in Reader.required_files:
...             self.full_data[file] = [0, 0, 10]  # this should be the data from the file
>>> r = Reader("", {}, 10)
__init__(start_dir, reader_config, loglevel=10)
Parameters
  • start_dir (str) – location where the directory/txt folder to the data can be found.

  • reader_config (dict) – mapping of the file reader configuration (as e.g. given in the config.yml file)

  • loglevel (int) – level of the logger

MQReader

class mspypeline.MQReader(start_dir, reader_config, index_col='Gene name', duplicate_handling='sum', drop_columns=None, loglevel=10)
A child class of the BaseReader.
The MQReader preprocesses data from MaxQuant files into the internal data format to provide the correct input for the plotters. Required files to start the MQReader is the proteinGroups.txt file from MaxQuant.
Additionally, the file reader can preprocess the evidence, msmsScans, msScans, parameters, peptides and summary txt files from the MaxQuant output.
The reader also recognizes sample_mapping.txt files if provided and corrects the sample naming for instance in the case of naming convention violation (see Analysis Design).
__init__(start_dir, reader_config, index_col='Gene name', duplicate_handling='sum', drop_columns=None, loglevel=10)
Parameters
  • start_dir (str) – location where the directory/txt folder to the data can be found.

  • reader_config (dict) – mapping of the file reader configuration (as e.g. given in the config.yml file)

  • index_col (str) – with which identification type should detected proteins in the proteinGroups.txt file be handled. If provided in the reader_config will be taken from there.

  • duplicate_handling (str) – how should proteins with duplicate index_col be treated ? can be “sum” or “drop”. If provided in the reader_config will be taken from there.

  • drop_columns (Union[list, tuple, str]) – samples to be excluded from the analysis. If provided in the reader_config will be taken from there.

  • loglevel (int) – level of the logger

plotter

alias of mspypeline.core.MSPPlots.MaxQuantPlotter.MaxQuantPlotter

preprocess_contaminants()
Preprocess the proteinGroups.txt file to internal format and return DataFrame with all those proteins marked as contaminant.
Contaminants are defined as those proteins “Only identified by site”, marked as “Reverse” or as “Potential contaminant” in the proteinGroups.txt file.
Returns

DataFrame containing preprocessed data of contaminants from proteinGroups.txt file

Return type

DataFrame

preprocess_evidence()
Preprocess the evidence.txt file to internal format and return DataFrame with all those peptides not marked as contaminant.
Contaminants are defined as those peptides marked as “Reverse” or as “Potential contaminant” in the evidence.txt file.
Returns

DataFrame containing preprocessed data from evidence.txt file

Return type

DataFrame

preprocess_msScans()
Preprocess the msScans.txt file to internal format and return DataFrame.
Only columns “Raw file”, “Total ion current” and “Retention time” are read in.
Returns

DataFrame containing preprocessed data from msScans.txt file

Return type

DataFrame

preprocess_msmsScans()
Preprocess the msmsScans.txt file to internal format and return DataFrame.
Only columns “Raw file”, “Total ion current” and “Retention time” are read in.
Returns

DataFrame containing preprocessed data from msmsScans.txt file

Return type

DataFrame

preprocess_parameters()
Preprocess the parameters.txt file to internal format and return DataFrame.
Returns

DataFrame containing preprocessed data from parameters.txt file

Return type

DataFrame

preprocess_peptides()
Preprocess the peptides.txt file to internal format and return DataFrame with all those peptides not marked as contaminant.
Contaminants are defined as those peptides marked as “Reverse” or as “Potential contaminant” in the peptides.txt file.
Returns

DataFrame containing preprocessed data from peptides.txt file

Return type

DataFrame

preprocess_proteinGroups()
Preprocess the proteinGroups.txt file to internal format and return DataFrame with all those proteins not marked as contaminant.
Contaminants are defined as those proteins “Only identified by site”, marked as “Reverse” or as “Potential contaminant” in the proteinGroups.txt file.
Returns

DataFrame containing preprocessed data from proteinGroups.txt file

Return type

DataFrame

preprocess_summary()
Preprocess the summary.txt file to internal format and return DataFrame.
Returns

DataFrame containing preprocessed data from summary.txt file

Return type

DataFrame