Settings and Configurations¶
The configuration YAML file¶
mspypeline
at the start of the analysis. This file can be edited to further individualize the results.
If the data analysis is performed via the GUI, no further interaction with the YAML file is necessary.Analysis settings¶
Analysis Design¶
mspypeline
assumes that data consists of samples that can be arranged
into a tree structure resembling the experimental setup. Different samples of an experiment are arranged
in groups and subgroups dependent on the sample’s name. This naming convention is the key principle to draw comparisons
between distinct samples of different groups/at different levels. The analysis design can be of any level of depth.Warning
Different levels of the analysis design need to be separated by an underscore (_)
All samples must have the same number of levels (meaning same number of underscores)
Example Analysis Design
In [1]: from mspypeline.helpers import get_analysis_design
In [2]: from pprint import pprint
In [3]: samples = [f"Cancer_Line{l}_Rep{r}" for l in range(1, 3) for r in range(1, 4)] + [
...: f"Control_Line{l}_Rep{r}" for l in range(1, 3) for r in range(1, 4)]
...:
In [4]: analysis_design = get_analysis_design(samples)
In [5]: pprint(analysis_design)
{'Cancer': {'Line1': {'Rep1': 'Cancer_Line1_Rep1',
'Rep2': 'Cancer_Line1_Rep2',
'Rep3': 'Cancer_Line1_Rep3'},
'Line2': {'Rep1': 'Cancer_Line2_Rep1',
'Rep2': 'Cancer_Line2_Rep2',
'Rep3': 'Cancer_Line2_Rep3'}},
'Control': {'Line1': {'Rep1': 'Control_Line1_Rep1',
'Rep2': 'Control_Line1_Rep2',
'Rep3': 'Control_Line1_Rep3'},
'Line2': {'Rep1': 'Control_Line2_Rep1',
'Rep2': 'Control_Line2_Rep2',
'Rep3': 'Control_Line2_Rep3'}}}
Sample Mapping¶
default sample_mapping_template.txt file: This file is created automatically if the naming convention is incorrect. The file already provides the general structure of the sample mapping consisting of two columns old name which is readily filled out and new name which needs to be filled with the new desired sample name. Then the file needs to be renamed to sample_mapping.txt.
manual sample_mapping.txt file: The file needs to be tab-separated with two columns and saved on the same level as the config directory. The first column named old name should contain the sample name of the ms run. The second column named new name should follow the naming convention.
old name |
new name |
---|---|
Cancer_Line1-Rep1 |
Cancer_Line1_Rep1 |
Cance_Line1_Rep2 |
Cancer_Line1_Rep2 |
Cancer_Line_1_Rep3 |
Cancer_Line1_Rep3 |
Cancer_Line2_Rep1 |
Cancer_Line2_Rep1 |
Cancer_Line2_Rep2 |
Cancer_Line2_Rep2 |
Cancer… |
Cancer.. |
Technical Replicates¶
In [6]: import numpy as np
In [7]: import pandas as pd
In [8]: from mspypeline import DataTree
In [9]: data = pd.DataFrame(np.exp2(np.random.normal(26, 2, (3, 12))).astype(int), columns=samples)
In [10]: tree_agg = DataTree.from_analysis_design(analysis_design, data, True)
In [11]: tree_no_agg = DataTree.from_analysis_design(analysis_design, data, False)
In [12]: tree_no_agg.aggregate(None, None)
Out[12]:
Cancer_Line1_Rep1 Cancer_Line1_Rep2 ... Control_Line2_Rep2 Control_Line2_Rep3
0 702353563 20768149 ... 1140652832 8789414
1 167620185 4569444 ... 823925361 37301454
2 102764624 39249017 ... 182945746 76382219
[3 rows x 12 columns]
In [13]: tree_agg.aggregate(None, None)
Out[13]:
Cancer_Line1 Cancer_Line2 Control_Line1 Control_Line2
0 3.076896e+08 6.216679e+07 3.775543e+07 3.893132e+08
1 5.877328e+07 1.220655e+08 1.450985e+08 2.922692e+08
2 6.026846e+07 5.728456e+08 1.846741e+08 1.576707e+08
Thresholds and Comparisons¶
Unique in A: above threshold in A and completely absent in B
Unique in B: above threshold in B and completely absent in A
Can be compared: above threshold in A and B
Otherwise: not considered
In [14]: import matplotlib.pyplot as plt
In [15]: from mspypeline.helpers import get_number_of_non_na_values, get_non_na_percentage
In [16]: x_values = [x for x in range(1, 31)]
In [17]: y_values = [get_number_of_non_na_values(x) for x in x_values]
In [18]: fig, ax = plt.subplots(1,1, figsize=(7,5))
In [19]: ax.plot(x_values, y_values, marker=".");
In [20]: ax.set_xticks(x_values);
In [21]: ax.set_yticks(y_values);
In [22]: ax.set_xlabel("Number of samples");
In [23]: ax.set_ylabel("Required number of non na (missing) values");

Starting with a minimum number of 3, the number of non missing values to function as threshold increases steadily with rising numbers of samples per group.
An example: Group A has 7 samples, Group B has 8 Samples.
Unique in A: Group A has equals or more than 5 non missing values and Group B has only missing values
Unique in B: Group B has equals or more than 6 non missing values and Group A has only missing values
Can be compared: Group A has equals or more than 5 non missing values and Group B has equals or more than 6 non missing values
Not considered: In all other cases
This threshold criterion is quite harsh, but the results will be reliable.
The next plot shows the required percentage of non zero values as a function of the number of samples in a group.
In [24]: y_values = [get_non_na_percentage(x) for x in x_values]
In [25]: fig, ax = plt.subplots(1,1, figsize=(7,5))
In [26]: ax.plot(x_values, y_values, marker=".");
In [27]: ax.set_ylim(0, 1);
In [28]: ax.set_xlabel("Number of samples");
In [29]: ax.set_xticks(x_values);
In [30]: ax.set_ylabel("Required percentage/100 of non zero values");
