3  Exploratory Data Analysis for Metabolomics

PCA, outlier detection, and quality control for LC-MS diphenhydramine pharmacokinetics data

Author

Alban Ott (Claude Code CLI)

Published

February 9, 2026

3.1 Learning Objectives

NoteWhat you will learn
  • Summarize preprocessed metabolomics datasets with key quality metrics
  • Visualize intensity distributions to detect batch effects or outliers
  • Perform PCA for quality control and pattern discovery
  • Interpret PCA score plots colored by experimental factors (sample type, subject, timepoint)
  • Use Hotelling’s T2 statistic for multivariate outlier detection
  • Compare forearm and forehead sampling sites systematically

3.2 Background

Before building any predictive or classification model, exploratory data analysis (EDA) is essential to:

  1. Summarize the dataset dimensions, intensity ranges, and experimental design
  2. Detect issues such as batch effects, outliers, or systematic biases
  3. Discover patterns in the data that inform downstream modeling choices

For metabolomics data, PCA is the workhorse of EDA. It reduces hundreds of features to a few principal components that capture the dominant sources of variation. By coloring PCA score plots with experimental metadata, we can assess whether biological factors (drug treatment, subject, timepoint) drive the variation or whether technical artifacts dominate.

Hotelling’s T2 provides a formal multivariate outlier detection criterion based on the PCA model, complementing visual inspection of score plots.

3.3 Setup

import logging
import pickle
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Add project root to path for imports
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.eda import (
    compute_hotelling_t2,
    dataset_summary,
    perform_pca,
    plot_explained_variance,
    plot_hotelling_t2,
    plot_intensity_distribution,
    plot_pca_scores,
    plot_pca_scores_interactive,
)

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(name)s%(message)s")

# Plot style
sns.set_theme(style="whitegrid", palette="colorblind")
plt.rcParams["figure.dpi"] = 120
# Load preprocessed data from Phase 2
pickle_path = project_root / "data" / "preprocessing_result.pkl"
with open(pickle_path, "rb") as f:
    result = pickle.load(f)

# Extract datasets
X_forearm = result["forearm"]["X_processed"]
X_forehead = result["forehead"]["X_processed"]
meta_forearm = result["forearm"]["metadata"]
meta_forehead = result["forehead"]["metadata"]
feat_forearm = result["forearm"]["feature_metadata"]
feat_forehead = result["forehead"]["feature_metadata"]

print(f"Forearm:  {X_forearm.shape[0]} samples x {X_forearm.shape[1]} features")
print(f"Forehead: {X_forehead.shape[0]} samples x {X_forehead.shape[1]} features")
Forearm:  84 samples x 4195 features
Forehead: 84 samples x 2924 features

3.4 Dataset Summary

A quick overview of both datasets helps verify that preprocessing completed correctly and highlights key characteristics.

summary_forearm = dataset_summary(X_forearm, meta_forearm, feat_forearm)
summary_forehead = dataset_summary(X_forehead, meta_forehead, feat_forehead)
Table 3.1: Dataset summary for forearm and forehead sampling sites.
Loading ITables v2.7.0 from the internet... (need help?)

3.5 Intensity Distributions

Total absolute intensity per sample reveals whether any samples have unusually low or high overall signal, which could indicate injection failures, dilution errors, or contamination.

fig, axes = plt.subplots(2, 1, figsize=(7, 12))

plot_intensity_distribution(X_forearm, meta_forearm, group_col="sample_type", ax=axes[0])
axes[0].set_title("Forearm — Intensity by Sample Type")

plot_intensity_distribution(X_forehead, meta_forehead, group_col="sample_type", ax=axes[1])
axes[1].set_title("Forehead — Intensity by Sample Type")

plt.tight_layout()
plt.show()
Figure 3.1: forearm

Total absolute intensity per sample by sample type (a) forearm and (b) forehead.

TipReading intensity box plots
  • QC samples should cluster tightly (low variance) — they are repeated injections of the same pooled sample and serve as a measure of analytical reproducibility
  • Study samples naturally show more variation due to biological differences
  • Large differences between QC and study medians may indicate normalization issues

3.6 PCA Quality Control

3.6.1 Scree Plot

The scree plot shows how much variance each principal component explains. A steep drop after the first few components suggests strong structure in the data.

# Fit PCA on both datasets
pca_forearm = perform_pca(X_forearm, n_components=10)
pca_forehead = perform_pca(X_forehead, n_components=10)
# First plot - Forearm
ax = plot_explained_variance(pca_forearm)
ax.set_title("Forearm — Scree Plot")
plt.show()

# Second plot - Forehead
ax = plot_explained_variance(pca_forehead)
ax.set_title("Forehead — Scree Plot")
plt.show()
(a) forearm
(b) forehead
Figure 3.2: Scree plots showing individual (bars) and cumulative (line) explained variance (a) forearm and (b) forehead.

3.6.2 Score Plot — Sample Type

Coloring by sample type reveals whether QC samples cluster (good analytical precision) and whether study samples separate from QCs (expected biological variation).

from plotly.subplots import make_subplots

fig_forearm = plot_pca_scores_interactive(
  pca_forearm, meta_forearm, color_by="sample_type"
)
fig_forearm.update_layout(title="Forearm — by Sample Type")
fig_forearm.show()

fig_forehead = plot_pca_scores_interactive(
  pca_forehead, meta_forehead, color_by="sample_type"
)
fig_forehead.update_layout(title="Forehead — by Sample Type")
fig_forehead.show()
(a) forearm
(b) forehead
Figure 3.3: PCA score plots (PC1 vs PC2) colored by sample type (a) forearm and (b) forehead.

3.6.3 Score Plot — Subject

Subject-level coloring reveals inter-individual variation, which is a dominant source of variance in pharmacokinetics studies.

fig_forearm = plot_pca_scores_interactive(
    pca_forearm, meta_forearm, color_by="subject"
)
fig_forearm.update_layout(title="Forearm — by Subject")
fig_forearm.show()

fig_forehead = plot_pca_scores_interactive(
    pca_forehead, meta_forehead, color_by="subject"
)
fig_forehead.update_layout(title="Forehead — by Subject")
fig_forehead.show()
(a) forearm
(b) forehead
Figure 3.4: PCA score plots (PC1 vs PC2) colored by subject (a) forearm and (b) forehead.

3.6.4 Score Plot — Timepoint

Timepoint coloring shows whether the drug treatment creates a temporal trajectory in the metabolic profile. Timepoints are converted to strings for categorical coloring.

fig_forearm = plot_pca_scores_interactive(
    pca_forearm, meta_forearm, color_by="timepoint"
)
fig_forearm.update_layout(title="Forearm — by Timepoint")
fig_forearm.show()

fig_forehead = plot_pca_scores_interactive(
    pca_forehead, meta_forehead, color_by="timepoint"
)
fig_forehead.update_layout(title="Forehead — by Timepoint")
fig_forehead.show()
(a) forearm
(b) forehead
Figure 3.5: PCA score plots (PC1 vs PC2) colored by timepoint (a) forearm and (b) forehead.
TipReading PCA score plots
  • Tight QC cluster: Good analytical reproducibility — instrument drift is minimal
  • Group separation: Samples separate by biological factor (subject, timepoint) — the factor drives metabolic variation
  • Outliers far from center: Samples with unusual metabolic profiles — investigate with Hotelling’s T2 below
  • Percentage on axes: Higher % means the component captures more variation. PC1 > PC2 by definition
  • Reference lines at 0: Help identify which quadrant samples fall in, useful for interpreting loadings

3.7 Outlier Detection

Hotelling’s T2 is a multivariate distance metric based on the PCA model. Samples with T2 values exceeding the threshold are potential outliers that warrant investigation.

# Compute T2 using the first 5 components
t2_forearm = compute_hotelling_t2(pca_forearm["scores"], n_components=5)
t2_forehead = compute_hotelling_t2(pca_forehead["scores"], n_components=5)
# First plot - Forearm
ax = plot_hotelling_t2(
    t2_forearm["t2_scores"],
    t2_forearm["t2_limit"],
    sample_names=result["forearm"]["sample_names"],
)
ax.set_title("Forearm — Hotelling's T$^2$")
plt.show()

# Second plot - Forehead
ax = plot_hotelling_t2(
    t2_forehead["t2_scores"],
    t2_forehead["t2_limit"],
    sample_names=result["forehead"]["sample_names"],
)
ax.set_title("Forehead — Hotelling's T$^2$")
plt.show()
(a) forearm
(b) forehead
Figure 3.6: Hotelling’s T2 outlier detection (a) forearm and (b) forehead. Red bars exceed the 2x threshold.

Forearm: 10 outlier(s) detected. T² limit: 5.25, threshold (2×): 10.51.

Table 3.2
Loading ITables v2.7.0 from the internet... (need help?)

Forehead: 11 outlier(s) detected. T² limit: 5.25, threshold (2×): 10.51.

Table 3.3
Loading ITables v2.7.0 from the internet... (need help?)
WarningHandling outliers

Outliers detected by Hotelling’s T2 are not automatically removed. Before excluding any sample, consider:

  1. Is it a QC sample? QC outliers indicate instrument issues at that injection point
  2. Is it a study sample? The outlier may reflect genuine biological variation (e.g., a strong drug responder)
  3. Does it appear in both forearm and forehead? Consistent outliers across sites suggest a subject-level issue, not a site-specific artifact
  4. Check the original chromatograms if possible to rule out injection failures

For this tutorial, we report outliers but do not remove them. Removal decisions should be made with domain knowledge and documented with justification.

3.8 Summary

Table 3.4: Comparison of forearm and forehead datasets after EDA.
Loading ITables v2.7.0 from the internet... (need help?)
NoteKey findings
  • PCA captures the dominant structure in the data with the first few components explaining a substantial fraction of total variance
  • QC samples provide a benchmark for analytical reproducibility — their clustering in PCA score plots confirms consistent instrument performance
  • Subject and timepoint effects are visible in the PCA, supporting the pharmacokinetics study design where metabolic profiles change over time after drug application
  • Hotelling’s T2 identifies samples that deviate multivariate from the population — these warrant investigation but are not automatically excluded
TipNext steps

With EDA complete, the data is ready for:

  1. Supervised modeling (PLS-DA, OPLS-DA) to identify features that discriminate between timepoints or treatment conditions
  2. Biomarker discovery using variable importance in projection (VIP) scores
  3. Pathway analysis to connect discriminating features to biological pathways