3 Exploratory Data Analysis for Metabolomics

PCA, outlier detection, and quality control for LC-MS diphenhydramine pharmacokinetics data

Author

Alban Ott (Claude Code CLI)

Published

February 9, 2026

3.1 Learning Objectives

What you will learn

Summarize preprocessed metabolomics datasets with key quality metrics
Visualize intensity distributions to detect batch effects or outliers
Perform PCA for quality control and pattern discovery
Interpret PCA score plots colored by experimental factors (sample type, subject, timepoint)
Use Hotelling’s T2 statistic for multivariate outlier detection
Compare forearm and forehead sampling sites systematically

3.2 Background

Before building any predictive or classification model, exploratory data analysis (EDA) is essential to:

Summarize the dataset dimensions, intensity ranges, and experimental design
Detect issues such as batch effects, outliers, or systematic biases
Discover patterns in the data that inform downstream modeling choices

For metabolomics data, PCA is the workhorse of EDA. It reduces hundreds of features to a few principal components that capture the dominant sources of variation. By coloring PCA score plots with experimental metadata, we can assess whether biological factors (drug treatment, subject, timepoint) drive the variation or whether technical artifacts dominate.

Hotelling’s T2 provides a formal multivariate outlier detection criterion based on the PCA model, complementing visual inspection of score plots.

3.3 Setup

import logging
import pickle
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Add project root to path for imports
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.eda import (
    compute_hotelling_t2,
    dataset_summary,
    perform_pca,
    plot_explained_variance,
    plot_hotelling_t2,
    plot_intensity_distribution,
    plot_pca_scores,
    plot_pca_scores_interactive,
)

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(name)s — %(message)s")

# Plot style
sns.set_theme(style="whitegrid", palette="colorblind")
plt.rcParams["figure.dpi"] = 120

# Load preprocessed data from Phase 2
pickle_path = project_root / "data" / "preprocessing_result.pkl"
with open(pickle_path, "rb") as f:
    result = pickle.load(f)

# Extract datasets
X_forearm = result["forearm"]["X_processed"]
X_forehead = result["forehead"]["X_processed"]
meta_forearm = result["forearm"]["metadata"]
meta_forehead = result["forehead"]["metadata"]
feat_forearm = result["forearm"]["feature_metadata"]
feat_forehead = result["forehead"]["feature_metadata"]

print(f"Forearm:  {X_forearm.shape[0]} samples x {X_forearm.shape[1]} features")
print(f"Forehead: {X_forehead.shape[0]} samples x {X_forehead.shape[1]} features")

Forearm:  84 samples x 4195 features
Forehead: 84 samples x 2924 features

3.4 Dataset Summary

A quick overview of both datasets helps verify that preprocessing completed correctly and highlights key characteristics.

summary_forearm = dataset_summary(X_forearm, meta_forearm, feat_forearm)
summary_forehead = dataset_summary(X_forehead, meta_forehead, feat_forehead)

Table 3.1: Dataset summary for forearm and forehead sampling sites.

Loading ITables v2.7.0 from the internet... (need help?)

3.5 Intensity Distributions

Total absolute intensity per sample reveals whether any samples have unusually low or high overall signal, which could indicate injection failures, dilution errors, or contamination.

fig, axes = plt.subplots(2, 1, figsize=(7, 12))

plot_intensity_distribution(X_forearm, meta_forearm, group_col="sample_type", ax=axes[0])
axes[0].set_title("Forearm — Intensity by Sample Type")

plot_intensity_distribution(X_forehead, meta_forehead, group_col="sample_type", ax=axes[1])
axes[1].set_title("Forehead — Intensity by Sample Type")

plt.tight_layout()
plt.show()

Total absolute intensity per sample by sample type (a) forearm and (b) forehead.

Reading intensity box plots

QC samples should cluster tightly (low variance) — they are repeated injections of the same pooled sample and serve as a measure of analytical reproducibility
Study samples naturally show more variation due to biological differences
Large differences between QC and study medians may indicate normalization issues

3.6 PCA Quality Control

3.6.1 Scree Plot

The scree plot shows how much variance each principal component explains. A steep drop after the first few components suggests strong structure in the data.

# Fit PCA on both datasets
pca_forearm = perform_pca(X_forearm, n_components=10)
pca_forehead = perform_pca(X_forehead, n_components=10)

# First plot - Forearm
ax = plot_explained_variance(pca_forearm)
ax.set_title("Forearm — Scree Plot")
plt.show()

# Second plot - Forehead
ax = plot_explained_variance(pca_forehead)
ax.set_title("Forehead — Scree Plot")
plt.show()

3.6.2 Score Plot — Sample Type

Coloring by sample type reveals whether QC samples cluster (good analytical precision) and whether study samples separate from QCs (expected biological variation).

from plotly.subplots import make_subplots

fig_forearm = plot_pca_scores_interactive(
  pca_forearm, meta_forearm, color_by="sample_type"
)
fig_forearm.update_layout(title="Forearm — by Sample Type")
fig_forearm.show()

fig_forehead = plot_pca_scores_interactive(
  pca_forehead, meta_forehead, color_by="sample_type"
)
fig_forehead.update_layout(title="Forehead — by Sample Type")
fig_forehead.show()

(a) forearm

(b) forehead

Figure 3.3: PCA score plots (PC1 vs PC2) colored by sample type (a) forearm and (b) forehead.

3.6.3 Score Plot — Subject

Subject-level coloring reveals inter-individual variation, which is a dominant source of variance in pharmacokinetics studies.

fig_forearm = plot_pca_scores_interactive(
    pca_forearm, meta_forearm, color_by="subject"
)
fig_forearm.update_layout(title="Forearm — by Subject")
fig_forearm.show()

fig_forehead = plot_pca_scores_interactive(
    pca_forehead, meta_forehead, color_by="subject"
)
fig_forehead.update_layout(title="Forehead — by Subject")
fig_forehead.show()

(a) forearm

(b) forehead

Figure 3.4: PCA score plots (PC1 vs PC2) colored by subject (a) forearm and (b) forehead.

3.6.4 Score Plot — Timepoint

Timepoint coloring shows whether the drug treatment creates a temporal trajectory in the metabolic profile. Timepoints are converted to strings for categorical coloring.

fig_forearm = plot_pca_scores_interactive(
    pca_forearm, meta_forearm, color_by="timepoint"
)
fig_forearm.update_layout(title="Forearm — by Timepoint")
fig_forearm.show()

fig_forehead = plot_pca_scores_interactive(
    pca_forehead, meta_forehead, color_by="timepoint"
)
fig_forehead.update_layout(title="Forehead — by Timepoint")
fig_forehead.show()

(a) forearm

(b) forehead

Figure 3.5: PCA score plots (PC1 vs PC2) colored by timepoint (a) forearm and (b) forehead.

Reading PCA score plots

Tight QC cluster: Good analytical reproducibility — instrument drift is minimal
Group separation: Samples separate by biological factor (subject, timepoint) — the factor drives metabolic variation
Outliers far from center: Samples with unusual metabolic profiles — investigate with Hotelling’s T2 below
Percentage on axes: Higher % means the component captures more variation. PC1 > PC2 by definition
Reference lines at 0: Help identify which quadrant samples fall in, useful for interpreting loadings

3.7 Outlier Detection

Hotelling’s T2 is a multivariate distance metric based on the PCA model. Samples with T2 values exceeding the threshold are potential outliers that warrant investigation.

# Compute T2 using the first 5 components
t2_forearm = compute_hotelling_t2(pca_forearm["scores"], n_components=5)
t2_forehead = compute_hotelling_t2(pca_forehead["scores"], n_components=5)

# First plot - Forearm
ax = plot_hotelling_t2(
    t2_forearm["t2_scores"],
    t2_forearm["t2_limit"],
    sample_names=result["forearm"]["sample_names"],
)
ax.set_title("Forearm — Hotelling's T$^2$")
plt.show()

# Second plot - Forehead
ax = plot_hotelling_t2(
    t2_forehead["t2_scores"],
    t2_forehead["t2_limit"],
    sample_names=result["forehead"]["sample_names"],
)
ax.set_title("Forehead — Hotelling's T$^2$")
plt.show()

Forearm: 10 outlier(s) detected. T² limit: 5.25, threshold (2×): 10.51.

Table 3.2

Loading ITables v2.7.0 from the internet... (need help?)

Forehead: 11 outlier(s) detected. T² limit: 5.25, threshold (2×): 10.51.

Table 3.3

Loading ITables v2.7.0 from the internet... (need help?)

Handling outliers

Outliers detected by Hotelling’s T2 are not automatically removed. Before excluding any sample, consider:

Is it a QC sample? QC outliers indicate instrument issues at that injection point
Is it a study sample? The outlier may reflect genuine biological variation (e.g., a strong drug responder)
Does it appear in both forearm and forehead? Consistent outliers across sites suggest a subject-level issue, not a site-specific artifact
Check the original chromatograms if possible to rule out injection failures

For this tutorial, we report outliers but do not remove them. Removal decisions should be made with domain knowledge and documented with justification.

3.8 Summary

Table 3.4: Comparison of forearm and forehead datasets after EDA.

Loading ITables v2.7.0 from the internet... (need help?)

Key findings

PCA captures the dominant structure in the data with the first few components explaining a substantial fraction of total variance
QC samples provide a benchmark for analytical reproducibility — their clustering in PCA score plots confirms consistent instrument performance
Subject and timepoint effects are visible in the PCA, supporting the pharmacokinetics study design where metabolic profiles change over time after drug application
Hotelling’s T2 identifies samples that deviate multivariate from the population — these warrant investigation but are not automatically excluded

Next steps

With EDA complete, the data is ready for:

Supervised modeling (PLS-DA, OPLS-DA) to identify features that discriminate between timepoints or treatment conditions
Biomarker discovery using variable importance in projection (VIP) scores
Pathway analysis to connect discriminating features to biological pathways