import logging
import pickle
import sys
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Add project root to path for imports
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
sys.path.insert(0, str(project_root))
from src.eda import (
compute_hotelling_t2,
dataset_summary,
perform_pca,
plot_explained_variance,
plot_hotelling_t2,
plot_intensity_distribution,
plot_pca_scores,
plot_pca_scores_interactive,
)
# Configure logging
logging.basicConfig(level=logging.INFO, format="%(name)s — %(message)s")
# Plot style
sns.set_theme(style="whitegrid", palette="colorblind")
plt.rcParams["figure.dpi"] = 1203 Exploratory Data Analysis for Metabolomics
PCA, outlier detection, and quality control for LC-MS diphenhydramine pharmacokinetics data
3.1 Learning Objectives
- Summarize preprocessed metabolomics datasets with key quality metrics
- Visualize intensity distributions to detect batch effects or outliers
- Perform PCA for quality control and pattern discovery
- Interpret PCA score plots colored by experimental factors (sample type, subject, timepoint)
- Use Hotelling’s T2 statistic for multivariate outlier detection
- Compare forearm and forehead sampling sites systematically
3.2 Background
Before building any predictive or classification model, exploratory data analysis (EDA) is essential to:
- Summarize the dataset dimensions, intensity ranges, and experimental design
- Detect issues such as batch effects, outliers, or systematic biases
- Discover patterns in the data that inform downstream modeling choices
For metabolomics data, PCA is the workhorse of EDA. It reduces hundreds of features to a few principal components that capture the dominant sources of variation. By coloring PCA score plots with experimental metadata, we can assess whether biological factors (drug treatment, subject, timepoint) drive the variation or whether technical artifacts dominate.
Hotelling’s T2 provides a formal multivariate outlier detection criterion based on the PCA model, complementing visual inspection of score plots.
3.3 Setup
# Load preprocessed data from Phase 2
pickle_path = project_root / "data" / "preprocessing_result.pkl"
with open(pickle_path, "rb") as f:
result = pickle.load(f)
# Extract datasets
X_forearm = result["forearm"]["X_processed"]
X_forehead = result["forehead"]["X_processed"]
meta_forearm = result["forearm"]["metadata"]
meta_forehead = result["forehead"]["metadata"]
feat_forearm = result["forearm"]["feature_metadata"]
feat_forehead = result["forehead"]["feature_metadata"]
print(f"Forearm: {X_forearm.shape[0]} samples x {X_forearm.shape[1]} features")
print(f"Forehead: {X_forehead.shape[0]} samples x {X_forehead.shape[1]} features")Forearm: 84 samples x 4195 features
Forehead: 84 samples x 2924 features
3.4 Dataset Summary
A quick overview of both datasets helps verify that preprocessing completed correctly and highlights key characteristics.
summary_forearm = dataset_summary(X_forearm, meta_forearm, feat_forearm)
summary_forehead = dataset_summary(X_forehead, meta_forehead, feat_forehead)| Loading ITables v2.7.0 from the internet... (need help?) |
3.5 Intensity Distributions
Total absolute intensity per sample reveals whether any samples have unusually low or high overall signal, which could indicate injection failures, dilution errors, or contamination.
fig, axes = plt.subplots(2, 1, figsize=(7, 12))
plot_intensity_distribution(X_forearm, meta_forearm, group_col="sample_type", ax=axes[0])
axes[0].set_title("Forearm — Intensity by Sample Type")
plot_intensity_distribution(X_forehead, meta_forehead, group_col="sample_type", ax=axes[1])
axes[1].set_title("Forehead — Intensity by Sample Type")
plt.tight_layout()
plt.show()Total absolute intensity per sample by sample type (a) forearm and (b) forehead.
- QC samples should cluster tightly (low variance) — they are repeated injections of the same pooled sample and serve as a measure of analytical reproducibility
- Study samples naturally show more variation due to biological differences
- Large differences between QC and study medians may indicate normalization issues
3.6 PCA Quality Control
3.6.1 Scree Plot
The scree plot shows how much variance each principal component explains. A steep drop after the first few components suggests strong structure in the data.
# Fit PCA on both datasets
pca_forearm = perform_pca(X_forearm, n_components=10)
pca_forehead = perform_pca(X_forehead, n_components=10)# First plot - Forearm
ax = plot_explained_variance(pca_forearm)
ax.set_title("Forearm — Scree Plot")
plt.show()
# Second plot - Forehead
ax = plot_explained_variance(pca_forehead)
ax.set_title("Forehead — Scree Plot")
plt.show()3.6.2 Score Plot — Sample Type
Coloring by sample type reveals whether QC samples cluster (good analytical precision) and whether study samples separate from QCs (expected biological variation).
from plotly.subplots import make_subplots
fig_forearm = plot_pca_scores_interactive(
pca_forearm, meta_forearm, color_by="sample_type"
)
fig_forearm.update_layout(title="Forearm — by Sample Type")
fig_forearm.show()
fig_forehead = plot_pca_scores_interactive(
pca_forehead, meta_forehead, color_by="sample_type"
)
fig_forehead.update_layout(title="Forehead — by Sample Type")
fig_forehead.show()3.6.3 Score Plot — Subject
Subject-level coloring reveals inter-individual variation, which is a dominant source of variance in pharmacokinetics studies.
fig_forearm = plot_pca_scores_interactive(
pca_forearm, meta_forearm, color_by="subject"
)
fig_forearm.update_layout(title="Forearm — by Subject")
fig_forearm.show()
fig_forehead = plot_pca_scores_interactive(
pca_forehead, meta_forehead, color_by="subject"
)
fig_forehead.update_layout(title="Forehead — by Subject")
fig_forehead.show()3.6.4 Score Plot — Timepoint
Timepoint coloring shows whether the drug treatment creates a temporal trajectory in the metabolic profile. Timepoints are converted to strings for categorical coloring.
fig_forearm = plot_pca_scores_interactive(
pca_forearm, meta_forearm, color_by="timepoint"
)
fig_forearm.update_layout(title="Forearm — by Timepoint")
fig_forearm.show()
fig_forehead = plot_pca_scores_interactive(
pca_forehead, meta_forehead, color_by="timepoint"
)
fig_forehead.update_layout(title="Forehead — by Timepoint")
fig_forehead.show()- Tight QC cluster: Good analytical reproducibility — instrument drift is minimal
- Group separation: Samples separate by biological factor (subject, timepoint) — the factor drives metabolic variation
- Outliers far from center: Samples with unusual metabolic profiles — investigate with Hotelling’s T2 below
- Percentage on axes: Higher % means the component captures more variation. PC1 > PC2 by definition
- Reference lines at 0: Help identify which quadrant samples fall in, useful for interpreting loadings
3.7 Outlier Detection
Hotelling’s T2 is a multivariate distance metric based on the PCA model. Samples with T2 values exceeding the threshold are potential outliers that warrant investigation.
# Compute T2 using the first 5 components
t2_forearm = compute_hotelling_t2(pca_forearm["scores"], n_components=5)
t2_forehead = compute_hotelling_t2(pca_forehead["scores"], n_components=5)# First plot - Forearm
ax = plot_hotelling_t2(
t2_forearm["t2_scores"],
t2_forearm["t2_limit"],
sample_names=result["forearm"]["sample_names"],
)
ax.set_title("Forearm — Hotelling's T$^2$")
plt.show()
# Second plot - Forehead
ax = plot_hotelling_t2(
t2_forehead["t2_scores"],
t2_forehead["t2_limit"],
sample_names=result["forehead"]["sample_names"],
)
ax.set_title("Forehead — Hotelling's T$^2$")
plt.show()Forearm: 10 outlier(s) detected. T² limit: 5.25, threshold (2×): 10.51.
| Loading ITables v2.7.0 from the internet... (need help?) |
Forehead: 11 outlier(s) detected. T² limit: 5.25, threshold (2×): 10.51.
| Loading ITables v2.7.0 from the internet... (need help?) |
Outliers detected by Hotelling’s T2 are not automatically removed. Before excluding any sample, consider:
- Is it a QC sample? QC outliers indicate instrument issues at that injection point
- Is it a study sample? The outlier may reflect genuine biological variation (e.g., a strong drug responder)
- Does it appear in both forearm and forehead? Consistent outliers across sites suggest a subject-level issue, not a site-specific artifact
- Check the original chromatograms if possible to rule out injection failures
For this tutorial, we report outliers but do not remove them. Removal decisions should be made with domain knowledge and documented with justification.
3.8 Summary
| Loading ITables v2.7.0 from the internet... (need help?) |
- PCA captures the dominant structure in the data with the first few components explaining a substantial fraction of total variance
- QC samples provide a benchmark for analytical reproducibility — their clustering in PCA score plots confirms consistent instrument performance
- Subject and timepoint effects are visible in the PCA, supporting the pharmacokinetics study design where metabolic profiles change over time after drug application
- Hotelling’s T2 identifies samples that deviate multivariate from the population — these warrant investigation but are not automatically excluded
With EDA complete, the data is ready for:
- Supervised modeling (PLS-DA, OPLS-DA) to identify features that discriminate between timepoints or treatment conditions
- Biomarker discovery using variable importance in projection (VIP) scores
- Pathway analysis to connect discriminating features to biological pathways