Analytics¶
analytics.py¶
-
unit_vector(vector)[source]¶ Returns the unit vector of the vector. :param tuple vector: vector :return tuple unit_vector: unit vector
-
flatten(t, my_list=[])[source]¶ Code from: https://gist.github.com/shaxbee/0ada767debf9eefbdb6e Acknowledgements: Zbigniew Mandziejewicz (shaxbee) Generator flattening the structure
>>> list(flatten([2, [2, (4, 5, [7], [2, [6, 2, 6, [6], 4]], 6)]])) [2, 2, 4, 5, 7, 2, 6, 2, 6, 6, 4, 6]
-
angle_between(v1, v2)[source]¶ Returns the angle in radians between vectors ‘v1’ and ‘v2’
- Parameters
- Return float angle
angle between two vectors in radians
- Example::
angle = angle_between((1, 0, 0), (0, 1, 0))
-
transform_into_wide_format(data, index, columns, values, extra=[])[source]¶ This function converts a Pandas DataFrame from long to wide format using pandas pivot_table() function.
- Parameters
- Returns
Wide-format pandas DataFrame
Example:
result = transform_into_wide_format(df, index='index', columns='x', values='y', extra='group')
-
transform_into_long_format(data, drop_columns, group, columns=['name', 'y'])[source]¶ Converts a Pandas DataDrame from wide to long format using pd.melt() function.
- Parameters
- Returns
Long-format Pandas DataFrame.
Example:
result = transform_into_long_format(df, drop_columns=['sample', 'subject'], group='group', columns=['name','y'])
-
get_ranking_with_markers(data, drop_columns, group, columns, list_markers, annotation={})[source]¶ This function creates a long-format dataframe with features and values to be plotted together with disease biomarker annotations.
- Parameters
data – wide-format Pandas DataFrame with samples as rows and features as columns
drop_columns (list) – columns to be deleted
group (str) – column to use as identifier variables
columns (list) – names to use for the 1)variable column, and for the 2)value column
list_markers (list) – list of features from data, known to be markers associated to disease.
annotation (dict) – markers, from list_markers, and associated diseases.
- Returns
Long-format pandas DataFrame with group identifiers as rows and columns: ‘name’ (identifier), ‘y’ (LFQ intensity), ‘symbol’ and ‘size’.
Example:
result = get_ranking_with_markers(data, drop_columns=['sample', 'subject'], group='group', columns=['name', 'y'], list_markers, annotation={})
-
extract_number_missing(data, min_valid, drop_cols=['sample'], group='group')[source]¶ Counts how many valid values exist in each column and filters column labels with more valid values than the minimum threshold defined.
- Parameters
data – pandas DataFrame with group as rows and protein identifier as column.
group (str) – column label containing group identifiers. If None, number of valid values is counted across all samples, otherwise is counted per unique group identifier.
min_valid (int) – minimum number of valid values to be filtered.
drop_columns (list) – column labels to be dropped.
- Returns
List of column labels above the threshold.
Example:
result = extract_number_missing(data, min_valid=3, drop_cols=['sample'], group='group')
-
extract_percentage_missing(data, missing_max, drop_cols=['sample'], group='group', how='all')[source]¶ Extracts ratio of missing/valid values in each column and filters column labels with lower ratio than the minimum threshold defined.
- Parameters
data – pandas dataframe with group as rows and protein identifier as column.
group (str) – column label containing group identifiers. If None, ratio is calculated across all samples, otherwise is calculated per unique group identifier.
missing_max (float) – maximum ratio of missing/valid values to be filtered.
how (str) – define if labels with a higher percentage of missing values than the threshold in any group (‘any’) or in all groups (‘all’) should be filtered
- Returns
List of column labels below the threshold.
- Example::
result = extract_percentage_missing(data, missing_max=0.3, drop_cols=[‘sample’], group=’group’)
-
imputation_KNN(data, drop_cols=['group', 'sample', 'subject'], group='group', cutoff=0.6, alone=True)[source]¶ k-Nearest Neighbors imputation for pandas dataframes with missing data. For more information visit https://github.com/iskandr/fancyimpute/blob/master/fancyimpute/knn.py.
- Parameters
data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
group (str) – column label containing group identifiers.
drop_cols (list) – column labels to be dropped. Final dataframe should only have gene/protein/etc identifiers as columns.
cutoff (float) – minimum ratio of missing/valid values required to impute in each column.
alone (boolean) – if True removes all columns with any missing values.
- Returns
Pandas dataframe with samples as rows and protein identifiers as columns.
Example:
result = imputation_KNN(data, drop_cols=['group', 'sample', 'subject'], group='group', cutoff=0.6, alone=True)
-
imputation_mixed_norm_KNN(data, index_cols=['group', 'sample', 'subject'], shift=1.8, nstd=0.3, group='group', cutoff=0.6)[source]¶ Missing values are replaced in two steps: 1) using k-Nearest Neighbors we impute protein columns with a higher ratio of missing/valid values than the defined cutoff, 2) the remaining missing values are replaced by random numbers that are drawn from a normal distribution.
- Parameters
data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
group (str) – column label containing group identifiers.
index_cols (list) – list of column labels to be set as dataframe index.
shift (float) – specifies the amount by which the distribution used for the random numbers is shifted downwards. This is in units of the standard deviation of the valid data.
nstd (float) – defines the width of the Gaussian distribution relative to the standard deviation of measured values. A value of 0.5 would mean that the width of the distribution used for drawing random numbers is half of the standard deviation of the data.
cutoff (float) – minimum ratio of missing/valid values required to impute in each column.
- Returns
Pandas dataframe with samples as rows and protein identifiers as columns.
Example:
result = imputation_mixed_norm_KNN(data, index_cols=['group', 'sample', 'subject'], shift = 1.8, nstd = 0.3, group='group', cutoff=0.6)
-
imputation_normal_distribution(data, index_cols=['group', 'sample', 'subject'], shift=1.8, nstd=0.3)[source]¶ Missing values will be replaced by random numbers that are drawn from a normal distribution. The imputation is done for each sample (across all proteins) separately. For more information visit http://www.coxdocs.org/doku.php?id=perseus:user:activities:matrixprocessing:imputation:replacemissingfromgaussian.
- Parameters
data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
index_cols (list) – list of column labels to be set as dataframe index.
shift (float) – specifies the amount by which the distribution used for the random numbers is shifted downwards. This is in units of the standard deviation of the valid data.
nstd (float) – defines the width of the Gaussian distribution relative to the standard deviation of measured values. A value of 0.5 would mean that the width of the distribution used for drawing random numbers is half of the standard deviation of the data.
- Returns
Pandas dataframe with samples as rows and protein identifiers as columns.
Example:
result = imputation_normal_distribution(data, index_cols=['group', 'sample', 'subject'], shift = 1.8, nstd = 0.3)
-
normalize_data_per_group(data, group, method='median', normalize=None)[source]¶ This function normalizes the data by group using the selected method
- Parameters
data – DataFrame with the data to be normalized (samples x features)
group_col – Column containing the groups
method (str) – normalization method to choose among: median_polish, median, quantile, linear
normalize (str) – whether the normalization should be done by ‘features’ (columns) or ‘samples’ (rows) (default None)
- Returns
Pandas dataframe.
Example:
result = normalize_data_per_group(data, group='group' method='median')
-
normalize_data(data, method='median', normalize=None)[source]¶ This function normalizes the data using the selected method
- Parameters
data – DataFrame with the data to be normalized (samples x features)
method (string) – normalization method to choose among: median_polish, median, quantile, linear
normalize (str) – whether the normalization should be done by ‘features’ (columns) or ‘samples’ (rows) (default None)
- Returns
Pandas dataframe.
Example:
result = normalize_data(data, method='median_polish')
-
median_zero_normalization(data, normalize='samples')[source]¶ This function normalizes each sample by using its median.
- Parameters
data –
normalize (str) – whether the normalization should be done by ‘features’ (columns) or ‘samples’ (rows)
- Returns
Pandas dataframe.
- Example::
data = pd.DataFrame({‘a’: [2,5,4,3,3], ‘b’:[4,4,6,5,3], ‘c’:[4,14,8,8,9]}) result = median_normalization(data, normalize=’samples’) result
a b c
0 -1.333333 0.666667 0.666667 1 -2.666667 -3.666667 6.333333 2 -2.000000 0.000000 2.000000 3 -2.333333 -0.333333 2.666667 4 -2.000000 -2.000000 4.000000
-
median_normalization(data, normalize='samples')[source]¶ This function normalizes each sample by using its median.
- Parameters
data –
normalize (str) – whether the normalization should be done by ‘features’ (columns) or ‘samples’ (rows)
- Returns
Pandas dataframe.
- Example::
data = pd.DataFrame({‘a’: [2,5,4,3,3], ‘b’:[4,4,6,5,3], ‘c’:[4,14,8,8,9]}) result = median_normalization(data, normalize=’samples’) result
a b c
0 -1.333333 0.666667 0.666667 1 -2.666667 -3.666667 6.333333 2 -2.000000 0.000000 2.000000 3 -2.333333 -0.333333 2.666667 4 -2.000000 -2.000000 4.000000
-
zscore_normalization(data, normalize='samples')[source]¶ This function normalizes each sample by using its mean and standard deviation (mean=0, std=1).
- Parameters
data –
normalize (str) – whether the normalization should be done by ‘features’ (columns) or ‘samples’ (rows)
- Returns
Pandas dataframe.
- Example::
data = pd.DataFrame({‘a’: [2,5,4,3,3], ‘b’:[4,4,6,5,3], ‘c’:[4,14,8,8,9]}) result = zscore_normalization(data, normalize=’samples’) result
a b c
0 -1.154701 0.577350 0.577350 1 -0.484182 -0.665750 1.149932 2 -1.000000 0.000000 1.000000 3 -0.927173 -0.132453 1.059626 4 -0.577350 -0.577350 1.154701
-
median_polish_normalization(data, max_iter=250)[source]¶ This function iteratively normalizes each sample and each feature to its median until medians converge.
- Parameters
data –
max_iter (int) – number of maximum iterations to prevent infinite loop.
- Returns
Pandas dataframe.
- Example::
data = pd.DataFrame({‘a’: [2,5,4,3,3], ‘b’:[4,4,6,5,3], ‘c’:[4,14,8,8,9]}) result = median_polish_normalization(data, max_iter = 10) result
a b c
0 2.0 4.0 7.0 1 5.0 7.0 10.0 2 4.0 6.0 9.0 3 3.0 5.0 8.0 4 3.0 5.0 8.0
-
quantile_normalization(data)[source]¶ Applies quantile normalization to each column in pandas dataframe.
- Parameters
data – pandas dataframe with features as columns and samples as rows.
- Returns
Pandas dataframe
- Example::
data = pd.DataFrame({‘a’: [2,5,4,3,3], ‘b’:[4,4,6,5,3], ‘c’:[4,14,8,8,9]}) result = quantile_normalization(data) result
a b c
0 3.2 4.6 4.6 1 4.6 3.2 8.6 2 3.2 4.6 8.6 3 3.2 4.6 8.6 4 3.2 3.2 8.6
-
linear_normalization(data, method='l1', normalize='samples')[source]¶ This function scales input data to a unit norm. For more information visit https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html.
- Parameters
data – pandas dataframe with samples as rows and features as columns.
method (str) – norm to use to normalize each non-zero sample or non-zero feature (depends on axis).
normalize (str) – axis used to normalize the data along. If ‘samples’, independently normalize each sample, if ‘features’ normalize each feature.
- Returns
Pandas dataframe
- Example::
data = pd.DataFrame({‘a’: [2,5,4,3,3], ‘b’:[4,4,6,5,3], ‘c’:[4,14,8,8,9]}) result = linear_normalization(data, method = “l1”, by = ‘feature’) result
a b c
0 0.117647 0.181818 0.093023 1 0.294118 0.181818 0.325581 2 0.235294 0.272727 0.186047 3 0.176471 0.227273 0.186047 4 0.176471 0.136364 0.209302
-
remove_group(data)[source]¶ Removes column with label ‘group’.
- Parameters
data – pandas dataframe with one column labelled ‘group’
- Returns
Pandas dataframe
Example:
result = remove_group(data)
-
calculate_coefficient_variation(values)[source]¶ Compute the coefficient of variation, the ratio of the biased standard deviation to the mean, in percentage. For more information visit https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.variation.html.
- Parameters
values (ndarray) – numpy array
- Returns
The calculated variation along rows.
- Return type
ndarray
Example:
result = calculate_coefficient_variation()
-
get_coefficient_variation(data, drop_columns, group, columns=['name', 'y'])[source]¶ Extracts the coefficients of variation in each group.
- Parameters
data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_columns (list) – column labels to be dropped from the dataframe
group (str) – column label containing group identifiers.
columns (list) – names to use for the variable column(s), and for the value column(s)
- Returns
Pandas dataframe with columns ‘name’ (protein identifier), ‘x’ (coefficient of variation), ‘y’ (mean) and ‘group’.
Exmaple:
result = get_coefficient_variation(data, drop_columns=['sample', 'subject'], group='group')
-
transform_proteomics_edgelist(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', value_col='LFQ_intensity')[source]¶ Transforms a long format proteomics matrix into a wide format
- Parameters
df – long-format pandas dataframe with columns ‘group’, ‘sample’, ‘subject’, ‘identifier’ (protein), ‘name’ (gene) and ‘LFQ_intensity’.
index_cols (list) – column labels to be be kept as index identifiers.
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
identifier (str) – column label containing feature identifiers.
extra_identifier (str) – column label containing additional protein identifiers (e.g. gene names).
value_col (str) – column label containing expression values.
- Returns
Pandas dataframe with samples as rows and protein identifiers (UniprotID~GeneName) as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
- Example:
df = transform_proteomics_edgelist(original, index_cols=[‘group’, ‘sample’, ‘subject’], drop_cols=[‘sample’], group=’group’, identifier=’identifier’, value_col=’LFQ_intensity’)
-
get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', filter_samples=False, filter_samples_percent=0.5, imputation=True, method='distribution', missing_method='percentage', missing_per_group=True, missing_max=0.3, min_valid=1, value_col='LFQ_intensity', shift=1.8, nstd=0.3, knn_cutoff=0.6, normalize=False, normalization_method='median', normalize_group=False, normalize_by=None)[source]¶ Processes proteomics data extracted from the database: 1) filter proteins with high number of missing values (> missing_max or min_valid), 2) impute missing values.
- Parameters
df – long-format pandas dataframe with columns ‘group’, ‘sample’, ‘subject’, ‘identifier’ (protein), ‘name’ (gene) and ‘LFQ_intensity’.
index_cols (list) – column labels to be be kept as index identifiers.
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
identifier (str) – column label containing feature identifiers.
extra_identifier (str) – column label containing additional protein identifiers (e.g. gene names).
filter_samples (bool) – if True filter samples with valid values below percentage (filter_samples_percent).
filter_samples_percent (float) – defines the maximum percentage of missing values allowed in a sample.
imputation (bool) – if True performs imputation of missing values.
method (str) – method for missing values imputation (‘KNN’, ‘distribuition’, or ‘mixed’)
missing_method (str) – defines which expression rows are counted to determine if a column has enough valid values to survive the filtering process.
missing_per_group (bool) – if True filter proteins based on valid values per group; if False filter across all samples.
missing_max (float) – maximum ratio of missing/valid values to be filtered.
min_valid (int) – minimum number of valid values to be filtered.
value_col (str) – column label containing expression values.
shift (float) – when using distribution imputation, the down-shift
nstd (float) – when using distribution imputation, the width of the distribution
knn_cutoff (float) – when using KNN imputation, the minimum percentage of valid values for which to use KNN imputation (i.e. 0.6 -> if 60% valid values use KNN, otherwise MinProb)
normalize (bool) – whether or not to normalize the data
normalization_method (str) – method to be used to normalize the data (‘median’, ‘quantile’, ‘linear’, ‘zscore’, ‘median_polish’) (only with normalize=True)
normalize_group (bool) – normalize per group or not (only with normalize=True)
normalize_by (str) – whether the normalization should be done by ‘features’ (columns) or ‘samples’ (rows) (only with normalize=True)
- Returns
Pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
Example 1:
result = get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation=True, method = 'distribution', missing_method = 'percentage', missing_per_group=True, missing_max = 0.3, value_col='LFQ_intensity')
Example 2:
result = get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation = True, method = 'mixed', missing_method = 'at_least_x', missing_per_group=False, min_valid=5, value_col='LFQ_intensity')
-
get_clinical_measurements_ready(df, subject_id='subject', sample_id='biological_sample', group_id='group', columns=['clinical_variable'], values='values', extra=['group'], imputation=True, imputation_method='KNN', missing_method='percentage', missing_max=0.3, min_valid=1)[source]¶ Processes clinical data extracted from the database by converting dataframe to wide-format and imputing missing values.
- Parameters
df – long-format pandas dataframe with columns ‘group’, ‘biological_sample’, ‘subject’, ‘clinical_variable’, ‘value’.
subject_id (str) – column label containing subject identifiers.
sample_id (str) – column label containing biological sample identifiers.
group_id (str) – column label containing group identifiers.
columns (list) – column name whose unique values will become the new column names
values (str) – column label containing clinical variable values.
extra (list) – additional column labels to be kept as columns
imputation (bool) – if True performs imputation of missing values.
imputation_method (str) – method for missing values imputation (‘KNN’, ‘distribuition’, or ‘mixed’).
missing_method (str) – defines which expression rows are counted to determine if a column has enough valid values to survive the filtering process.
missing_max (float) – maximum ratio of missing/valid values to be filtered.
min_valid (int) – minimum number of valid values to be filtered.
- Returns
Pandas dataframe with samples as rows and clinical variables as columns (with additional columns ‘group’, ‘subject’ and ‘biological_sample’).
Example:
result = get_clinical_measurements_ready(df, subject_id='subject', sample_id='biological_sample', group_id='group', columns=['clinical_variable'], values='values', extra=['group'], imputation=True, imputation_method='KNN')
-
get_summary_data_matrix(data)[source]¶ Returns some statistics on the data matrix provided.
- Parameters
data – pandas dataframe.
- Returns
dictionary with the type of statistics as key and the statistic as value in the shape of a pandas data frame
Example:
result = get_summary_data_matrix(data)
-
check_equal_variances(data, drop_cols=['group', 'sample', 'subject'], group_col='group', alpha=0.05)[source]¶
-
check_normality(data, drop_cols=['group', 'sample', 'subject'], group_col='group', alpha=0.05)[source]¶
-
run_pca(data, drop_cols=['sample', 'subject'], group='group', annotation_cols=['sample'], components=2, dropna=True)[source]¶ Performs principal component analysis and returns the values of each component for each sample and each protein, and the loadings for each protein. For information visit https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.
- Parameters
data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
annotation_cols (list) – list of columns to be added in the scatter plot annotation
components (int) – number of components to keep.
dropna (bool) – if True removes all columns with any missing values.
- Returns
tuple: 1) three pandas dataframes: components, loadings and variance; 2) xaxis and yaxis titles with components loadings for plotly.
Example:
result = run_pca(data, drop_cols=['sample', 'subject'], group='group', components=2, dropna=True)
-
run_tsne(data, drop_cols=['sample', 'subject'], group='group', annotation_cols=['sample'], components=2, perplexity=40, n_iter=1000, init='pca', dropna=True)[source]¶ Performs t-distributed Stochastic Neighbor Embedding analysis. For more information visit https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.
- Parameters
data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
components (int) – dimension of the embedded space.
annotation_cols (list) – list of columns to be added in the scatter plot annotation
perplexity (int) – related to the number of nearest neighbors that is used in other manifold learning algorithms. Consider selecting a value between 5 and 50.
n_iter (int) – maximum number of iterations for the optimization (at least 250).
init (str) – initialization of embedding (‘random’, ‘pca’ or numpy array of shape n_samples x n_components).
dropna (bool) – if True removes all columns with any missing values.
- Returns
Two dictionaries: 1) pandas dataframe with embedding vectors, 2) xaxis and yaxis titles for plotly.
Example:
result = run_tsne(data, drop_cols=['sample', 'subject'], group='group', components=2, perplexity=40, n_iter=1000, init='pca', dropna=True)
-
run_umap(data, drop_cols=['sample', 'subject'], group='group', annotation_cols=['sample'], n_neighbors=10, min_dist=0.3, metric='cosine', dropna=True)[source]¶ Performs Uniform Manifold Approximation and Projection. For more information vist https://umap-learn.readthedocs.io.
- Parameters
data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
annotation_cols (list) – list of columns to be added in the scatter plot annotation
n_neighbors (int) – number of neighboring points used in local approximations of manifold structure.
min_dist (float) – controls how tightly the embedding is allowed compress points together.
metric (str) – metric used to measure distance in the input space.
dropna (bool) – if True removes all columns with any missing values.
- Returns
Two dictionaries: 1) pandas dataframe with embedding of the training data in low-dimensional space, 2) xaxis and yaxis titles for plotly.
Example:
result = run_umap(data, drop_cols=['sample', 'subject'], group='group', n_neighbors=10, min_dist=0.3, metric='cosine', dropna=True)
-
calculate_correlations(x, y, method='pearson')[source]¶ Calculates a Spearman (nonparametric) or a Pearson (parametric) correlation coefficient and p-value to test for non-correlation.
- Parameters
x (ndarray) – array 1
y (ndarray) – array 2
method (str) – chooses which kind of correlation method to run
- Returns
Tuple with two floats, correlation coefficient and two-tailed p-value.
Example:
result = calculate_correlations(x, y, method='pearson')
-
apply_pvalue_correction(pvalues, alpha=0.05, method='bonferroni')[source]¶ Performs p-value correction using the specified method. For more information visit https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html.
- Parameters
pvalues (ndarray) – et of p-values of the individual tests.
alpha (float) – error rate.
method (str) – method of p-value correction: - bonferroni : one-step correction - sidak : one-step correction - holm-sidak : step down method using Sidak adjustments - holm : step-down method using Bonferroni adjustments - simes-hochberg : step-up method (independent) - hommel : closed method based on Simes tests (non-negative) - fdr_bh : Benjamini/Hochberg (non-negative) - fdr_by : Benjamini/Yekutieli (negative) - fdr_tsbh : two stage fdr correction (non-negative) - fdr_tsbky : two stage fdr correction (non-negative)
- Returns
Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.
Exmaple:
result = apply_pvalue_correction(pvalues, alpha=0.05, method='bonferroni')
-
apply_pvalue_fdrcorrection(pvalues, alpha=0.05, method='indep')[source]¶ Performs p-value correction for false discovery rate. For more information visit https://www.statsmodels.org/devel/generated/statsmodels.stats.multitest.fdrcorrection.html.
- Parameters
- Returns
Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.
Exmaple:
result = apply_pvalue_fdrcorrection(pvalues, alpha=0.05, method='indep')
-
apply_pvalue_twostage_fdrcorrection(pvalues, alpha=0.05, method='bh')[source]¶ Iterated two stage linear step-up procedure with estimation of number of true hypotheses. For more information visit https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.fdrcorrection_twostage.html.
- Parameters
- Returns
Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.
Exmaple:
result = apply_pvalue_twostage_fdrcorrection(pvalues, alpha=0.05, method='bh')
-
apply_pvalue_permutation_fdrcorrection(df, observed_pvalues, group, alpha=0.05, permutations=50)[source]¶ This function applies multiple hypothesis testing correction using a permutation-based false discovery rate approach.
- Parameters
df – pandas dataframe with samples as rows and features as columns.
oberved_pvalues – pandas Series with p-values calculated on the originally measured data.
group (str) – name of the column containing group identifiers.
alpha (float) – error rate. Values velow alpha are considered significant.
permutations (int) – number of permutations to be applied.
- Returns
Pandas dataframe with adjusted p-values and rejected columns.
Example:
result = apply_pvalue_permutation_fdrcorrection(df, observed_pvalues, group='group', alpha=0.05, permutations=50)
-
get_counts_permutation_fdr(value, random, observed, n, alpha)[source]¶ Calculates local FDR values (q-values) by computing the fraction of accepted hits from the permuted data over accepted hits from the measured data normalized by the total number of permutations.
- Parameters
value (float) – computed p-value on measured data for a feature.
random (ndarray) – p-values computed on the permuted data.
observed – pandas Series with p-values calculated on the originally measured data.
n (int) – number of permutations to be applied.
alpha (float) – error rate. Values velow alpha are considered significant.
- Returns
Tuple with q-value and boolean for H0 rejected.
Example:
result = get_counts_permutation_fdr(value, random, observed, n=250, alpha=0.05)
-
convertToEdgeList(data, cols)[source]¶ This function converts a pandas dataframe to an edge list where index becomes the source nodes and columns the target nodes.
- Parameters
data – pandas dataframe.
cols (list) – names for dataframe columns.
- Returns
Pandas dataframe with columns cols.
-
run_correlation(df, alpha=0.05, subject='subject', group='group', method='pearson', correction='fdr_bh')[source]¶ This function calculates pairwise correlations for columns in dataframe, and returns it in the shape of a edge list with ‘weight’ as correlation score, and the ajusted p-values.
- Parameters
df – pandas dataframe with samples as rows and features as columns.
subject (str) – name of column containing subject identifiers.
group (str) – name of column containing group identifiers.
method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).
alpha (floar) – error rate. Values velow alpha are considered significant.
correction (string) – type of correction see apply_pvalue_correction for methods
- Returns
Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘padj’ and ‘rejected’.
Example:
result = run_correlation(df, alpha=0.05, subject='subject', group='group', method='pearson', correction='fdr_bh')
-
run_multi_correlation(df_dict, alpha=0.05, subject='subject', on=['subject', 'biological_sample'], group='group', method='pearson', correction='fdr_bh')[source]¶ This function merges all input dataframes and calculates pairwise correlations for all columns.
- Parameters
df_dict (dict) – dictionary of pandas dataframes with samples as rows and features as columns.
subject (str) – name of the column containing subject identifiers.
group (str) – name of the column containing group identifiers.
on (list) – column names to join dataframes on (must be found in all dataframes).
method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).
alpha (float) – error rate. Values velow alpha are considered significant.
correction (string) – type of correction see apply_pvalue_correction for methods
- Returns
Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘padj’ and ‘rejected’.
Example:
result = run_multi_correlation(df_dict, alpha=0.05, subject='subject', on=['subject', 'biological_sample'] , group='group', method='pearson', correction='fdr_bh')
-
calculate_rm_correlation(df, x, y, subject)[source]¶ Computes correlation and p-values between two columns a and b in df.
- Parameters
- Returns
Tuple with values for: feature a, feature b, correlation, p-value and degrees of freedom.
Example:
result = calculate_rm_correlation(df, x='feature a', y='feature b', subject='subject')
-
run_rm_correlation(df, alpha=0.05, subject='subject', correction='fdr_bh')[source]¶ Computes pairwise repeated measurements correlations for all columns in dataframe, and returns results as an edge list with ‘weight’ as correlation score, p-values, degrees of freedom and ajusted p-values.
- Parameters
- Returns
Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘pvalue’, ‘dof’, ‘padj’ and ‘rejected’.
Example:
result = run_rm_correlation(df, alpha=0.05, subject='subject', correction='fdr_bh')
-
run_efficient_correlation(data, method='pearson')[source]¶ Calculates pairwise correlations and returns lower triangle of the matrix with correlation values and p-values.
- Parameters
data – pandas dataframe with samples as index and features as columns (numeric data only).
method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).
- Returns
Two numpy arrays: correlation and p-values.
Example:
result = run_efficient_correlation(data, method='pearson')
-
calculate_ttest_samr(df, labels, n=2, s0=0, paired=False)[source]¶ Calculates modified T-test using ‘samr’ R package.
- Parameters
df – pandas dataframe with group as columns and protein identifier as rows
abels (list) – integers reflecting the group each sample belongs to (e.g. group1 = 1, group2 = 2)
n (int) – number of samples
s0 (float) – exchangeability factor for denominator of test statistic
paired (bool) – True if samples are paired
- Returns
Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group1)’, ‘log2FC’, ‘FC’, ‘t-statistics’, ‘p-value’.
Example:
result = calculate_ttest_samr(df, labels, n=2, s0=0.1, paired=False)
-
calculate_ttest(df, condition1, condition2, paired=False, is_logged=True, non_par=False, tail='two-sided', correction='auto', r=0.707)[source]¶ Calculates the t-test for the means of independent samples belonging to two different groups. For more information visit https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.
- Parameters
df – pandas dataframe with groups and subjects as rows and protein identifier as column.
condition1 (str) – identifier of first group.
condition2 (str) – ientifier of second group.
is_logged (bool) – data is logged transformed
non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed
- Returns
Tuple with t-statistics, two-tailed p-value, mean of first group, mean of second group and logfc.
Example:
result = calculate_ttest(df, 'group1', 'group2')
-
calculate_THSD(df, column, group='group', alpha=0.05, is_logged=True)[source]¶ Pairwise Tukey-HSD posthoc test using pingouin stats. For more information visit https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html
- Parameters
- Returns
Pandas dataframe.
Example:
result = calculate_THSD(df, column='HBG2~P69892', group='group', alpha=0.05)
-
calculate_pairwise_ttest(df, column, subject='subject', group='group', correction='none', is_logged=True)[source]¶ Performs pairwise t-test using pingouin, as a posthoc test, and calculates fold-changes. For more information visit https://pingouin-stats.org/generated/pingouin.pairwise_ttests.html.
- Parameters
df – pandas dataframe with subject and group as rows and protein identifier as column.
column (str) – column label containing the dependant variable
subject (str) – column label containing subject identifiers
group (str) – column label containing the between factor
correction (str) – method used for testing and adjustment of p-values.
- Returns
Pandas dataframe with means, standard deviations, test-statistics, degrees of freedom and effect size columns.
Example:
result = calculate_pairwise_ttest(df, 'protein a', subject='subject', group='group', correction='none')
-
complement_posthoc(posthoc, identifier, is_logged)[source]¶ Calculates fold-changes after posthoc test.
- Parameters
posthoc – pandas dataframe from posthoc test. Should have at least columns ‘mean(group1)’ and ‘mean(group2)’.
identifier (str) – feature identifier.
- Returns
Pandas dataframe with additional columns ‘identifier’, ‘log2FC’ and ‘FC’.
-
calculate_dabest(df, idx, x, y, paired=False, id_col=None, test='mean_diff')[source]¶ - Parameters
df –
idx –
x –
y –
paired –
id_col –
test –
- Returns
-
calculate_anova_samr(df, labels, s0=0)[source]¶ Calculates modified one-way ANOVA using ‘samr’ R package.
- Parameters
- Returns
Pandas dataframe with protein identifiers and F-statistics.
Example:
result = calculate_anova_samr(df, labels, s0=0.1)
-
calculate_ancova(data, column, group='group', covariates=[])[source]¶ Calculates one-way ANCOVA using pingouin.
-
calculate_repeated_measures_anova(df, column, subject='subject', group='group')[source]¶ One-way and two-way repeated measures ANOVA using pingouin stats.
- Parameters
df – pandas dataframe with samples as rows and protein identifier as column. Data must be in long-format for two-way repeated measures.
column (str) – column label containing the dependant variable
subject (str) – column label containing subject identifiers
group (str) – column label containing the within factor
- Returns
Tuple with protein identifier, t-statistics and p-value.
Example:
result = calculate_repeated_measures_anova(df, 'protein a', subject='subject', group='group')
-
get_max_permutations(df, group='group')[source]¶ Get maximum number of permutations according to number of samples.
-
run_anova(df, alpha=0.05, drop_cols=['sample', 'subject'], subject='subject', group='group', permutations=0, correction='fdr_bh', is_logged=True, non_par=False)[source]¶ Performs statistical test for each protein in a dataset. Checks what type of data is the input (paired, unpaired or repeated measurements) and performs posthoc tests for multiclass data. Multiple hypothesis correction uses permutation-based if permutations>0 and Benjamini/Hochberg if permutations=0.
- Parameters
df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (str) – column with group identifiers
drop_cols (list) – column labels to be dropped from the dataframe
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates.
non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed
- Returns
Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘Log2FC’, ‘std_error’, ‘tail’, ‘t-statistics’, ‘posthoc pvalue’, ‘effsize’, ‘efftype’, ‘FC’, ‘rejected’, ‘F-statistics’, ‘p-value’, ‘correction’, ‘-log10 p-value’, and ‘method’.
Example:
result = run_anova(df, alpha=0.05, drop_cols=["sample",'subject'], subject='subject', group='group', permutations=50)
-
run_ancova(df, covariates, alpha=0.05, drop_cols=['sample', 'subject'], subject='subject', group='group', permutations=0, correction='fdr_bh', is_logged=True, non_par=False)[source]¶ Performs statistical test for each protein in a dataset. Checks what type of data is the input (paired, unpaired or repeated measurements) and performs posthoc tests for multiclass data. Multiple hypothesis correction uses permutation-based if permutations>0 and Benjamini/Hochberg if permutations=0.
- Parameters
df – pandas dataframe with samples as rows and protein identifiers and covariates as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
covariates (list) – list of covariates to include in the model (column in df)
subject (str) – column with subject identifiers
group (str) – column with group identifiers
drop_cols (list) – column labels to be dropped from the dataframe
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates.
non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed
- Returns
Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘Log2FC’, ‘std_error’, ‘tail’, ‘t-statistics’, ‘posthoc pvalue’, ‘effsize’, ‘efftype’, ‘FC’, ‘rejected’, ‘F-statistics’, ‘p-value’, ‘correction’, ‘-log10 p-value’, and ‘method’.
Example:
result = run_ancova(df, covariates=['age'], alpha=0.05, drop_cols=["sample",'subject'], subject='subject', group='group', permutations=50)
-
run_repeated_measurements_anova(df, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', permutations=50, correction='fdr_bh', is_logged=True)[source]¶ Performs repeated measurements anova and pairwise posthoc tests for each protein in dataframe.
- Parameters
df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (srt) – column with group identifiers
drop_cols (list) – column labels to be dropped from the dataframe
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates
- Returns
Pandas dataframe
Example:
result = run_repeated_measurements_anova(df, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', permutations=50)
-
format_anova_table(df, aov_results, pairwise_results, pairwise_cols, group, permutations, alpha, correction)[source]¶ Performs p-value correction (permutation-based and FDR) and converts pandas dataframe into final format.
- Parameters
df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
aov_results (list[tuple]) – list of tuples with anova results (one tuple per feature).
pairwise_results (list[dataframes]) – list of pandas dataframes with posthoc tests results
group (str) – column with group identifiers
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates
- Returns
Pandas dataframe
-
run_ttest(df, condition1, condition2, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', paired=False, correction='fdr_bh', permutations=50, is_logged=True, non_par=False)[source]¶ Runs t-test (paired/unpaired) for each protein in dataset and performs permutation-based (if permutations>0) or Benjamini/Hochberg (if permutations=0) multiple hypothesis correction.
- Parameters
df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
condition1 (str) – first of two conditions of the independent variable
condition2 (str) – second of two conditions of the independent variable
subject (str) – column with subject identifiers
group (str) – column with group identifiers (independent variable)
drop_cols (list) – column labels to be dropped from the dataframe
paired (bool) – paired or unpaired samples
correction (str) – method of pvalue correction see apply_pvalue_correction for methods
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates.
is_logged (bool) – data is log-transformed
non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed
- Returns
Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘std(group1)’, ‘std(group2)’, ‘Log2FC’, ‘FC’, ‘rejected’, ‘T-statistics’, ‘p-value’, ‘correction’, ‘-log10 p-value’, and ‘method’.
Example:
result = run_ttest(df, condition1='group1', condition2='group2', alpha = 0.05, drop_cols=['sample'], subject='subject', group='group', paired=False, correction='fdr_bh', permutations=50)
-
define_samr_method(df, subject, group, drop_cols)[source]¶ Method to identify the correct problem type to run with SAMR
- Parameters
- Returns
tuple with the method to be used (One Class, Two class paired, Two class unpaired or Multiclass) and the labels (conditions)
Example:
method, labels = define_samr_method(df, subject, group)
-
calculate_pvalue_from_tstats(tstat, dfn, dfk)[source]¶ Calculate two-tailed p-values from T- or F-statistics.
tstat: T/F distribution dfn: degrees of freedrom n (values) per protein (keys), i.e. number of obervations - number of groups (dict) dfk: degrees of freedrom n (values) per protein (keys), i.e. number of groups - 1 (dict)
-
run_samr(df, subject='subject', group='group', drop_cols=['subject', 'sample'], alpha=0.05, s0='null', permutations=250, fc=0, is_logged=True, localfdr=False)[source]¶ Python adaptation of the ‘samr’ R package for statistical tests with permutation-based correction and s0 parameter. For more information visit https://cran.r-project.org/web/packages/samr/samr.pdf. The method only runs if R is installed and permutations is higher than 0, otherwise ANOVA.
- Parameters
df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (str) – column with group identifiers
drop_cols (list) – columnlabels to be dropped from the dataframe
alpha (float) – error rate for multiple hypothesis correction
s0 (float) – exchangeability factor for denominator of test statistic
permutations (int) – number of permutations used to estimate false discovery rates. If number of permutations is equal to zero, the function will run anova with FDR Benjamini/Hochberg correction.
fc (float) – minimum fold change to define practical significance (needed when computing delta table)
- Returns
Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘Log2FC’, ‘FC’, ‘T-statistics’, ‘p-value’, ‘padj’, ‘correction’, ‘-log10 p-value’, ‘rejected’ and ‘method’
Example:
result = run_samr(df, subject='subject', group='group', drop_cols=['subject', 'sample'], alpha=0.05, s0=1, permutations=250, fc=0)
-
run_fisher(group1, group2, alternative='two-sided')[source]¶ annotated not-annotated group1 a b group2 c d ————————————
group1 = [a, b] group2 = [c, d]
odds, pvalue = stats.fisher_exact([[a, b], [c, d]])
-
run_kolmogorov_smirnov(dist1, dist2, alternative='two-sided')[source]¶ Compute the Kolmogorov-Smirnov statistic on 2 samples. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html
- Parameters
dist1 (list) – sequence of 1-D ndarray (first distribution to compare) drawn from a continuous distribution
dist2 (list) – sequence of 1-D ndarray (second distribution to compare) drawn from a continuous distribution
alternative (str) – defines the alternative hypothesis (default is ‘two-sided’): * ‘two-sided’ * ‘less’ * ‘greater’
- Returns
statistic float and KS statistic pvalue float Two-tailed p-value.
Example:
result = run_kolmogorov_smirnov(dist1, dist2, alternative='two-sided')
-
run_site_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', regex='(\\w+~.+)_\\w\\d+\\-\\w+', correction='fdr_bh')[source]¶ This function runs a simple enrichment analysis for significantly regulated protein sites in a dataset.
- Parameters
regulation_data – pandas dataframe resulting from differential regulation analysis.
annotation – pandas dataframe with annotations for features (columns: ‘annotation’, ‘identifier’ (feature identifiers), and ‘source’).
identifier (str) – name of the column from annotation containing feature identifiers.
groups (list) – column names from regulation_data containing group identifiers.
annotation_col (str) – name of the column from annotation containing annotation terms.
reject_col (str) – name of the column from regulatio_data containing boolean for rejected null hypothesis.
group_col (str) – column name for new column in annotation dataframe determining if feature belongs to foreground or background.
method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).
regex (str) – how to extract the annotated identifier from the site identifier
- Returns
Pandas dataframe with columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’.
Example:
result = run_site_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', match="(\w+~.+)_\w\d+\-\w+")
-
run_up_down_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.05, lfc_cutoff=1)[source]¶
-
run_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh')[source]¶ This function runs a simple enrichment analysis for significantly regulated features in a dataset.
- Parameters
regulation_data – pandas dataframe resulting from differential regulation analysis.
annotation – pandas dataframe with annotations for features (columns: ‘annotation’, ‘identifier’ (feature identifiers), and ‘source’).
identifier (str) – name of the column from annotation containing feature identifiers.
groups (list) – column names from regulation_data containing group identifiers.
annotation_col (str) – name of the column from annotation containing annotation terms.
reject_col (str) – name of the column from regulatio_data containing boolean for rejected null hypothesis.
group_col (str) – column name for new column in annotation dataframe determining if feature belongs to foreground or background.
method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).
- Returns
Pandas dataframe with columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’.
Example:
result = run_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher')
-
run_enrichment(data, foreground_id, background_id, foreground_pop, background_pop, annotation_col='annotation', group_col='group', identifier_col='identifier', method='fisher', correction='fdr_bh')[source]¶ Computes enrichment of the foreground relative to a given backgroung, using Fisher’s exact test, and corrects for multiple hypothesis testing.
- Parameters
data – pandas dataframe with annotations for dataset features (columns: ‘annotation’, ‘identifier’, ‘source’, ‘group’).
foreground_id (str) – group identifier of features that belong to the foreground.
background_id (str) – group identifier of features that belong to the background.
annotation_col (str) – name of the column containing annotation terms.
group_col (str) – name of column containing the group identifiers.
identifier_col (str) – name of column containing dependent variables identifiers.
method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).
- Returns
Pandas dataframe with annotation terms, features, number of foregroung/background features in each term, p-values and corrected p-values (columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’).
Example:
result = run_enrichment(data, foreground='foreground', background='background', foreground_pop=len(foreground_list), background_pop=len(background_list), annotation_col='annotation', group_col='group', identifier_col='identifier', method='fisher')
-
run_ssgsea(data, annotation, annotation_col='annotation', identifier_col='identifier', set_index=[], outdir=None, min_size=15, scale=False, permutations=0)[source]¶ Project each sample within a data set onto a space of gene set enrichment scores using the ssGSEA projection methodology described in Barbie et al., 2009.
- Parameters
data – pandas dataframe with the quantified features (i.e. subject x proteins)
annotation – pandas dataframe with the annotation to be used in the enrichment (i.e. CKG pathway annotation file)
annotation_col (str) – name of the column containing annotation terms.
identifier_col (str) – name of column containing dependent variables identifiers.
set_index (list) – column/s to be used as index. Enrichment will be calculated for these values (i.e [“subject”] will return subjects x pathways matrix of enrichment scores)
out_dir (str) – directory path where results will be stored (default None, tmp folder is used)
min_size (int) – minimum number of features (i.e. proteins) in enriched terms (i.e. pathways)
scale (bool) – whether or not to scale the data
permutations (int) – number of permutations used in the ssgsea analysis
- Returns
dictionary with two dataframes: es - enrichment scores, and nes - normalized enrichment scores.
- Example::
stproject = “P0000008” p = project.Project(stproject, datasets={}, knowledge=None, report={}, configuration_files=None) p.build_project(False) p.generate_report()
proteomics_dataset = p.get_dataset(“proteomics”) annotations = proteomics_dataset.get_dataframe(“pathway annotation”) processed = proteomics_dataset.get_dataframe(‘processed’)
result = run_ssgsea(processed, annotations, annotation_col=’annotation’, identifier_col=’identifier’, set_index=[‘group’, ‘sample’,’subject’], outdir=None, min_size=10, scale=False, permutations=0)
-
calculate_fold_change(df, condition1, condition2)[source]¶ Calculates fold-changes between two groups for all proteins in a dataframe.
- Parameters
- Returns
Numpy array.
Example:
result = calculate_fold_change(data, 'group1', 'group2')
-
pooled_standard_deviation(sample1, sample2, ddof)[source]¶ Calculates the pooled standard deviation. For more information visit https://www.hackdeploy.com/learn-what-is-statistical-power-with-python/.
- Parameters
sample1 (array) – numpy array with values for first group
sample2 (array) – numpy array with values for second group
ddof (int) – degrees of freedom
-
cohens_d(sample1, sample2, ddof)[source]¶ Calculates Cohen’s d effect size based on the distance between two means, measured in standard deviations. For more information visit https://www.hackdeploy.com/learn-what-is-statistical-power-with-python/.
- Parameters
sample1 (array) – numpy array with values for first group
sample2 (array) – numpy array with values for second group
ddof (int) – degrees of freedom
-
hedges_g(df, condition1, condition2, ddof=0)[source]¶ Calculates Hedges’ g effect size (more accurate for sample sizes below 20 than Cohen’s d). For more information visit https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanstd.html.
- Parameters
- Returns
Numpy array.
Example:
result = hedges_g(data, 'group1', 'group2', ddof=0)
-
power_analysis(data, group='group', groups=None, alpha=0.05, power=0.8, dep_var='nobs', figure=False)[source]¶
-
run_mapper(data, lenses=['l2norm'], n_cubes=15, overlap=0.5, n_clusters=3, linkage='complete', affinity='correlation')[source]¶ - Parameters
data –
lenses –
n_cubes –
overlap –
n_clusters –
linkage –
affinity –
- Returns
-
run_WGCNA(data, drop_cols_exp, drop_cols_cli, RsquaredCut=0.8, networkType='unsigned', minModuleSize=30, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.25, verbose=0, sd_cutoff=0)[source]¶ Runs an automated weighted gene co-expression network analysis (WGCNA), using input proteomics/transcriptomics/genomics and clinical variables data.
- Parameters
data (dict) – dictionary of pandas dataframes with processed clinical and experimental datasets
drop_cols_exp (list) – column names to be removed from the experimental dataset.
drop_cols_cli (list) – column names to be removed from the clinical dataset.
RsquaredCut (float) – desired minimum scale free topology fitting index R^2.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
minModuleSize (int) – minimum module size.
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
merge_modules (bool) – if True, very similar modules are merged.
MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
- Returns
Tuple with multiple pandas dataframes.
Example:
result = run_WGCNA(data, drop_cols_exp=['subject', 'sample', 'group', 'index'], drop_cols_cli=['subject', 'biological_sample', 'group', 'index'], RsquaredCut=0.8, networkType='unsigned', minModuleSize=30, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.25, verbose=0)
-
most_central_edge(G)[source]¶ Compute the eigenvector centrality for the graph G, and finds the highest value.
- Parameters
G (graph) – networkx graph
- Returns
Highest eigenvector centrality value.
- Return type
-
get_louvain_partitions(G, weight)[source]¶ Computes the partition of the graph nodes which maximises the modularity (or try..) using the Louvain heuristices. For more information visit https://python-louvain.readthedocs.io/en/latest/api.html.
-
get_network_communities(graph, args)[source]¶ Finds communities in a graph using different methods. For more information on the methods visit:
- Parameters
graph (graph) – networkx graph
args (dict) – config file arguments
- Returns
Dictionary of nodes and which community they belong to (from 0 to number of communities).
-
get_publications_abstracts(data, publication_col='publication', join_by=['publication', 'Proteins', 'Diseases'], index='PMID')[source]¶ Accesses NCBI PubMed over the WWW and retrieves the abstracts corresponding to a list of one or more PubMed IDs.
- Parameters
data – pandas dataframe of diseases and publications linked to a list of proteins (columns: ‘Diseases’, ‘Proteins’, ‘linkout’ and ‘publication’).
publication_col (str) – column label containing PubMed ids.
join_by (list) – column labels to be kept from the input dataframe.
index (str) – column label containing PubMed ids from the NCBI retrieved data.
- Returns
Pandas dataframe with publication information and columns ‘PMID’, ‘abstract’, ‘authors’, ‘date’, ‘journal’, ‘keywords’, ‘title’, ‘url’, ‘Proteins’ and ‘Diseases’.
Example:
result = get_publications_abstracts(data, publication_col='publication', join_by=['publication','Proteins','Diseases'], index='PMID')
-
eta_squared(aov)[source]¶ Calculates the effect size using Eta-squared.
- Parameters
aov – pandas dataframe with anova results from statsmodels.
- Returns
Pandas dataframe with additional Eta-squared column.
-
omega_squared(aov)[source]¶ Calculates the effect size using Omega-squared.
- Parameters
aov – pandas dataframe with anova results from statsmodels.
- Returns
Pandas dataframe with additional Omega-squared column.
-
run_two_way_anova(df, drop_cols=['sample'], subject='subject', group=['group', 'secondary_group'])[source]¶ Run a 2-way ANOVA when data[‘secondary_group’] is not empty
- Parameters
- Returns
Two dataframes, anova results and residuals.
Example:
result = run_two_way_anova(data, drop_cols=['sample'], subject='subject', group=['group', 'secondary_group'])
-
merge_for_polar(regulation_data, regulators, identifier_col='identifier', group_col='group', theta_col='modifier', aggr_func='mean', normalize=True)[source]¶
-
run_qc_markers_analysis(data, qc_markers, sample_col='sample', group_col='group', drop_cols=['subject'], identifier_col='identifier', qcidentifier_col='identifier', qcclass_col='class')[source]¶
-
get_snf_clusters(data_tuples, num_clusters=None, metric='euclidean', k=5, mu=0.5)[source]¶ Cluster samples based on Similarity Network Fusion (SNF) (ref: https://www.ncbi.nlm.nih.gov/pubmed/24464287)
- Parameters
df_tuples – list of (dataset,metric) tuples
index – how the datasets can be merged (common columns)
num_clusters – number of clusters to be identified, if None, the algorithm finds the best number based on SNF algorithm (recommended)
distance_metric – distance metric used to calculate the sample similarity network
k – number of neighbors used to measure local affinity (KNN)
mu – normalization factor to scale similarity kernel when constructing affinity matrix
- Return tuple
1) fused_aff: affinity clustered samples, 2) fused_labels: cluster labels, 3) num_clusters: number of clusters, 4) silhouette: average silhouette score
-
run_snf(df_dict, index, num_clusters=None, distance_metric='euclidean', k_affinity=5, mu_affinity=0.5)[source]¶ Runs Similarity Network Fusion: integration of multiple omics datasets to identify similar samples (clusters) (ref: https://www.ncbi.nlm.nih.gov/pubmed/24464287). We make use of the pyton version SNFpy (https://github.com/rmarkello/snfpy)
- Parameters
df_dict – dictionary of datasets to be used (i.e {‘rnaseq’: rnaseq_data, ‘proteomics’: proteomics_data})
index – how the datasets can be merged (common columns)
num_clusters – number of clusters to be identified, if None, the algorithm finds the best number based on SNF algorithm (recommended)
distance_metric – distance metric used to calculate the sample similarity network
k_affinity – number of neighbors used to measure local affinity (KNN)
mu_ffinity – normalization factor to scale similarity kernel when constructing affinity matrix
- Return tuple
1) feature_df: SNF features and mutual information score (MIscore), 2) fused_aff: adjacent similarity matrix, 3)fused_labels: cluster labels, 4) silhouette: silhouette score
wgcnaAnalysis.py¶
-
get_data(data, drop_cols_exp=['subject', 'group', 'sample', 'index'], drop_cols_cli=['subject', 'group', 'biological_sample', 'index'], sd_cutoff=0)[source]¶ This function cleanes up and formats experimental and clinical data into similarly shaped dataframes.
- Parameters
- Returns
Dictionary with experimental and clinical dataframes (keys are the same as in the input dictionary).
-
get_dendrogram(df, labels, distfun='euclidean', linkagefun='ward', div_clusters=False, fcluster_method='distance', fcluster_cutoff=15)[source]¶ This function calculates the distance matrix and performs hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.
- Parameters
df – pandas dataframe with samples/subjects as index and features as columns.
labels (list) – labels for the leaves of the tree.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
div_clusters (bool) – dividing dendrogram leaves into clusters (True or False).
fcluster_method (str) – criterion to use in forming flat clusters.
fcluster_cutoff (int) – maximum cophenetic distance between observations in each cluster.
- Returns
Dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’. If div_clusters is used, it will also return a dictionary of each cluster and respective leaves.
-
get_clusters_elements(linkage_matrix, fcluster_method, fcluster_cutoff, labels)[source]¶ This function implements the generation of flat clusters from an hierarchical clustering with the same interface as scipy.cluster.hierarchy.fcluster.
- Parameters
linkage_matrix (ndarray) – hierarchical clustering encoded with a linkage matrix.
fcluster_method (str) – criterion to use in forming flat clusters (‘inconsistent’, ‘distance’, ‘maxclust’, ‘monocrit’, ‘maxclust_monocrit’).
fcluster_cutoff (float) – maximum cophenetic distance between observations in each cluster.
labels (list) – labels for the leaves of the dendrogram.
- Returns
A dictionary where keys are the cluster numbers and values are the dendrogram leaves.
-
filter_df_by_cluster(df, clusters, number)[source]¶ Select only the members of a defined cluster.
- Parameters
- Returns
Pandas dataframe with all the features (columns) and samples/subjects belonging to the defined cluster (index).
-
df_sort_by_dendrogram(df, Z_dendrogram)[source]¶ Reorders pandas dataframe by index and according to the dendrogram list of leaf nodes labels.
- Parameters
df – pandas dataframe with the labels to be reordered as index.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.
- Returns
Reordered pandas dataframe.
-
get_percentiles_heatmap(df, Z_dendrogram, bydendro=True, bycols=False)[source]¶ This function transforms the absolute values in each row or column (option ‘bycols’) into relative values.
- Parameters
df – pandas dataframe with samples/subjects as index and features as columns.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.
bydendro (bool) – if labels should be ordered according to dendrogram list of leaf nodes labels set to True, otherwise set to False.
bycols (bool) – relative values calculated across rows (samples) then set to False. Calculation performed across columns (features) set to True.
- Returns
Pandas dataframe.
-
get_miss_values_df(data)[source]¶ Proccesses pandas dataframe so missing values can be plotted in heatmap with specific color.
- Parameters
data – pandas dataframe.
- Returns
Pandas dataframe with missing values as integer 1, and originally valid values as NaN.
-
paste_matrices(matrix1, matrix2, rows, cols)[source]¶ Takes two matrices with analog shapes and concatenates each value in matrix 1 with corresponding one in matrix 2, returning a single pandas dataframe.
- Parameters
matrix1 (ndarray) – input 1
matrix2 (ndarray) – input 2
- Returns
Pandas dataframe.
-
cutreeDynamic(distmatrix, linkagefun='average', minModuleSize=50, method='hybrid', deepSplit=2, pamRespectsDendro=False, distfun=None)[source]¶ This function implements the R cutreeDynamic wrapper in Python, provinding an access point for methods of adaptive branh pruning of hierarchical clustering dendrograms.
- Parameters
data – pandas dataframe.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
minModuleSize (int) – minimum module size.
method (str) – method to use (‘hybrid’ or ‘tree’).
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
- Returns
Numpy array of numerical labels giving assignment of objects to modules. Unassigned objects are labeled 0, the largest module has label 1, next largest 2 etc.
-
build_network(data, softPower=6, networkType='unsigned', linkagefun='average', method='hybrid', minModuleSize=50, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.4, verbose=0)[source]¶ Weighted gene network construction and module detection. Calculates co-expression similarity and adjacency, topological overlap matrix (TOM) and clusters features in modules.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
softPower (int) – soft-thresholding power.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
method (str) – method to use (‘hybrid’ or ‘tree’).
minModuleSize (int) – minimum module size.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
merge_modules (bool) – if True, very similar modules are merged.
MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
- Paran int deepSplit
provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
- Returns
Tuple with TOM dissimilarity pandas dataframe, numpy array with module colors per experimental feature.
-
pick_softThreshold(data, RsquaredCut=0.8, networkType='unsigned', verbose=0)[source]¶ Analysis of scale free topology for multiple soft thresholding powers. Aids the user in choosing a proper soft-thresholding power for network construction.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
RsquaredCut (float) – desired minimum scale free topology fitting index R^2.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
- Returns
Estimated appropriate soft-thresholding power: the lowest power for which the scale free topology fit R^2 exceeds RsquaredCut.
- Return type
-
identify_module_colors(matrix, linkagefun='average', method='hybrid', minModuleSize=30, deepSplit=2, pamRespectsDendro=False)[source]¶ Identifies co-expression modules and converts the numeric labels into colors.
- Parameters
matrix – dissimilarity structure as produced by R.stats dist.
minModuleSize (int) – minimum module size.
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
- Returns
Numpy array of strings with module color of each experimental feature.
-
calculate_module_eigengenes(data, modColors, softPower=6, dissimilarity=True)[source]¶ Calculates modules eigengenes to quantify co-expression similarity of entire modules.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
softPower (int) – soft-thresholding power.
dissimilarity – calculates dissimilarity of module eigengenes.
- Returns
Pandas dataframe with calculated module eigengenes. If dissimilarity is set to True, returns a tuple with two pandas dataframes, the first with the module eigengenes and the second with the eigengenes dissimilarity.
-
merge_similar_modules(data, modColors, MEDissThres=0.4, verbose=0)[source]¶ Merges modules in co-expression network that are too close as measured by the correlation of their eigengenes.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
- Para, float MEDissThres
maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
- Returns
Tuple containing pandas dataframe with eigengenes of the new merged modules, and array with module colors of each expeirmental feature.
-
calculate_ModuleTrait_correlation(df_exp, df_traits, MEs)[source]¶ Correlates eigengenes with external traits in order to identify the most significant module-trait associations.
- Parameters
df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
MEs – pandas dataframe with module eigengenes.
- Returns
Tuple with two pandas datafames, first the correlation between all module eigengenes and all clinical traits, second a dataframe with concatenated correlation and p-value used for heatmap annotation.
-
calculate_ModuleMembership(data, MEs)[source]¶ For each module, calculates the correlation of the module eigengene and the feature expression profile (quantitative measure of module membership (MM)).
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
MEs – pandas dataframe with module eigengenes.
- Returns
Tuple with two pandas dataframes, one with module membership correlations and another with p-values.
-
calculate_FeatureTraitSignificance(df_exp, df_traits)[source]¶ Quantifies associations of individual experimental features with the measured clinical traits, by defining Feature Significance (FS) as the absolute value of the correlation between the feature and the trait.
- Parameters
df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
- Returns
Tuple with two pandas dataframes, one with feature significance correlations and another with p-values.
-
get_FeaturesPerModule(data, modColors, mode='dictionary')[source]¶ Groups all experimental features by the co-expression module they belong to.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
mode (str) – type of the value returned by the function (‘dictionary’ or ‘dataframe’).
- Returns
Depending on selected mode, returns a dictionary or dataframe with module color per experimental feature.
-
get_ModuleFeatures(data, modColors, modules=[])[source]¶ Groups and returns a list of the experimental features clustered in specific co-expression modules.
- Parameters
data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
modules (list) – list of module colors of interest.
- Returns
List of lists with experimental features in each selected module.
-
get_EigengenesTrait_correlation(MEs, data)[source]¶ Eigengenes are used as representative profiles of the co-expression modules, and correlation between them is used to quantify module similarity. Clinical traits are added to the eigengenes to see how the traits fir into the eigengen network.
- Parameters
MEs – pandas dataframe with module eigengenes.
data – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
- Returns
Tuple with two pandas dataframes, one with features and traits recalculates module eigengenes dissimilarity, and another with all the overall correlations.