2 Metabolomics Data Preprocessing

Step-by-step tutorial for LC-MS diphenhydramine pharmacokinetics data

Author

Alban Ott (Claude Code CLI)

Published

February 8, 2026

2.1 Learning Objectives

What you will learn

Load and inspect GNPS quantification tables from LC-MS metabolomics experiments
Understand the structure of untargeted metabolomics data (features x samples)
Apply a complete preprocessing pipeline: blank filtering, zero handling, PQN normalization, Pareto scaling
Evaluate the effect of each preprocessing step through diagnostic visualizations
Run the full automated pipeline on multiple datasets

2.2 Background

This tutorial uses data from a diphenhydramine pharmacokinetics study where LC-MS metabolomics was used to monitor drug absorption through both plasma and skin sampling. Samples were collected from forearm and forehead skin sites at multiple timepoints after topical application.

The raw LC-MS data was processed through GNPS (Global Natural Products Social Molecular Networking) to produce feature quantification tables. Each row is a feature (defined by m/z and retention time), and each column is a sample with its measured peak area.

Before any statistical analysis or modeling, we need to preprocess the data to:

Remove contaminant features that appear primarily in blank samples
Handle missing values (zeros) from features below detection limits
Normalize for unwanted variation in sample concentration (dilution effects)
Scale features so that high-abundance metabolites don’t dominate the analysis

2.3 Setup

import logging
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Add project root to path for imports
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.preprocessing import (
    build_preprocessing_pipeline,
    filter_blanks,
    handle_zeros,
    load_gnps_quant_table,
    load_metadata,
    pareto_scale,
    pqn_normalize,
)

# Configure logging so we can see pipeline messages
logging.basicConfig(level=logging.INFO, format="%(name)s — %(message)s")

# Plot style
sns.set_theme(style="whitegrid", palette="colorblind")
plt.rcParams["figure.dpi"] = 120

DATA_DIR = project_root / "data" / "CD-MRG-metabolomics_summerschool_2023"

forearm_path = DATA_DIR / "forearm_iimn_gnps_quant.csv"
forehead_path = DATA_DIR / "forehead_iimn_gnps_quant.csv"
metadata_path = DATA_DIR / "metadata.txt"

2.4 Step 1: Load Data

We start by loading the GNPS quantification table. The function separates feature metadata (m/z, retention time, ion mobility, etc.) from the peak area matrix.

feature_meta, peak_areas = load_gnps_quant_table(forearm_path)

print(f"Feature metadata: {feature_meta.shape[0]} features x {feature_meta.shape[1]} columns")
feature_meta.head()

Feature metadata: 7171 features x 12 columns

	row m/z	row retention time	row ion mobility	row ion mobility unit	row CCS	correlation group ID	annotation network number	best ion	auto MS2 verify	identified by n=	partners	neutral M mass
row ID
9936	150.062931	4.965177	NaN	NaN	NaN	36.0	132.0	[M+H]+	NaN	2.0	9936;9951	149.0556
9699	150.595875	4.821931	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4684	151.039027	3.304741	NaN	NaN	NaN	429.0	NaN	NaN	NaN	NaN	NaN	NaN
3095	151.039049	2.789110	NaN	NaN	NaN	389.0	NaN	NaN	NaN	NaN	NaN	NaN
6318	151.075361	3.791733	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

print(f"Peak area matrix: {peak_areas.shape[0]} features x {peak_areas.shape[1]} samples")
print(f"\nPeak area summary statistics:")
peak_areas.iloc[:5, :5]

Peak area matrix: 7171 features x 88 samples

Peak area summary statistics:

	bld_plt2_blk_04.mzML	bld_plt2_05_720_1.mzML	bld_plt1_blk_02.mzML	bld_plt2_04_120_1.mzML	bld_plt2_03_240_1.mzML
row ID
9936	260.577760	42.259052	130.08562	0.00000	156.640750
9699	0.000000	0.000000	0.00000	0.00000	0.000000
4684	0.000000	0.000000	0.00000	0.00000	19.682175
3095	0.000000	0.000000	0.00000	0.00000	0.000000
6318	14.450171	3384.944600	0.00000	730.18884	1938.330300

Data structure

The GNPS quant table contains 13 metadata columns (m/z, retention time, ion mobility, correlation group, etc.) and the remaining columns are peak areas for each sample. Column names follow the pattern <filename> Peak area — the loading function strips the suffix automatically.

2.5 Step 2: Understand the Experimental Design

metadata = load_metadata(metadata_path)
metadata.head()

	filename	sample_type	subject	timepoint	location	analysis_order	is_blank
0	bld_plt1_01_0_1.mzML	plasma	Subject_1	0.0	not applicable	19	False
1	bld_plt1_01_120_1.mzML	plasma	Subject_1	120.0	not applicable	32	False
2	bld_plt1_01_600_1.mzML	plasma	Subject_1	600.0	not applicable	51	False
3	bld_plt1_01_720_1.mzML	plasma	Subject_1	720.0	not applicable	50	False
4	bld_plt1_03_0_1.mzML	plasma	Subject_3	0.0	not applicable	20	False

print("Sample types:")
print(metadata["sample_type"].value_counts().to_string())
print(f"\nSubjects: {metadata['subject'].nunique()}")
print(f"Timepoints: {sorted(metadata['timepoint'].dropna().unique())}")
print(f"Locations: {metadata['location'].unique().tolist()}")

Sample types:
sample_type
skin            126
plasma           42
plasma_blank      2
skin_blank        2

Subjects: 8
Timepoints: [np.float64(0.0), np.float64(120.0), np.float64(240.0), np.float64(480.0), np.float64(600.0), np.float64(720.0)]
Locations: ['not applicable', 'forehead', 'forearm', 'upper back']

# Crosstab of subjects x timepoints (excluding blanks)
samples = metadata[~metadata["is_blank"]].copy()
ct = pd.crosstab(samples["subject"], samples["timepoint"])
ct

timepoint	0.0	120.0	240.0	480.0	600.0	720.0
subject
Subject_1	4	4	4	4	4	4
Subject_10	4	4	4	4	4	4
Subject_3	4	4	4	4	4	4
Subject_4	4	4	4	4	4	4
Subject_5	4	4	4	4	4	4
Subject_7	4	4	4	4	4	4
Subject_8	4	4	4	4	4	4

2.6 Step 3: Blank Filtering

Why filter blanks first?

Blank samples (solvent or extraction blanks) capture background contamination from solvents, plasticware, and the instrument itself. Features that are primarily present in blanks rather than real samples are contaminants and will add noise to downstream analysis. Always filter blanks before other preprocessing steps.

We keep only features where the mean intensity in real samples is at least 10x the mean intensity in blanks.

Human note:

The Claude Code CLI original value was 3x. It seems consensual but the original notebook used 10, so I manually adapted.

The painpoint was: it was hard-coded multiple time. I manually switched to a single variable definition blank_ratio = 10. ➡ We may insist more on the best practices.

blank_ratio = 10
n_features_before = peak_areas.shape[0]
peak_areas_filtered, meta_no_blanks = filter_blanks(peak_areas, metadata, blank_ratio=blank_ratio)
n_features_after = peak_areas_filtered.shape[0]

print(f"Features before: {n_features_before}")
print(f"Features after:  {n_features_after}")
print(f"Removed:         {n_features_before - n_features_after}")
print(f"Samples (blanks removed): {peak_areas_filtered.shape[1]}")

Features before: 7171
Features after:  6555
Removed:         616
Samples (blanks removed): 84

# Recalculate ratios for visualization
blank_files = set(metadata.loc[metadata["is_blank"], "filename"])
all_files = set(peak_areas.columns)
blank_cols = sorted(blank_files & all_files)
sample_cols = sorted(all_files - blank_files)

blank_mean = peak_areas[blank_cols].mean(axis=1)
sample_mean = peak_areas[sample_cols].mean(axis=1)
ratio = sample_mean / blank_mean.replace(0, np.nan)
ratio_finite = ratio.dropna()

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(np.log10(ratio_finite.clip(lower=0.01)), bins=60, edgecolor="white", alpha=0.8)
ax.axvline(np.log10(blank_ratio), color="red", linestyle="--", linewidth=2, label=f"{blank_ratio}x threshold")
ax.set_xlabel("log₁₀(sample mean / blank mean)")
ax.set_ylabel("Number of features")
ax.set_title("Sample-to-Blank Intensity Ratio")
ax.legend()
plt.tight_layout()
plt.show()

Figure 2.1: Distribution of sample-to-blank intensity ratios across features. Features below the 3x threshold (red dashed line) are removed as likely contaminants.

2.7 Step 4: Zero Handling

Why handle zeros?

In LC-MS metabolomics, a zero value typically means the feature was below the detection limit, not truly absent. Features with too many zeros across samples are unreliable and should be removed. For the remaining zeros, we impute with min/2 (half the minimum observed nonzero value for that feature) — a common conservative approach that avoids log-transform issues while acknowledging the value was small.

Human note:

It’s quite a bad idea to impute with a constant value because it artificially lowers the variability of these features. The original notebooks used a random value between 0 and the minimum.

I keep it as is for educational reasons: in Boccard & Rudaz, 2018, imputation is not mentioned. Claude here seems to have improvised with a very common mispractice. This emphasizes how important it is to read, understand what is happening, and update skills based on your expertise.

zero_counts = (peak_areas_filtered == 0).sum(axis=1)
n_samples = peak_areas_filtered.shape[1]

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(zero_counts, bins=40, edgecolor="white", alpha=0.8)
ax.axvline(n_samples * 0.5, color="red", linestyle="--", linewidth=2,
           label=f"50% threshold ({n_samples * 0.5:.0f} samples)")
ax.set_xlabel("Number of zeros per feature")
ax.set_ylabel("Number of features")
ax.set_title("Zero Distribution Across Features")
ax.legend()
plt.tight_layout()
plt.show()

Figure 2.2: Distribution of zero counts per feature before filtering. Features with more than 50% zeros (red dashed line) are removed.

n_before_zero = peak_areas_filtered.shape[0]
peak_areas_imputed = handle_zeros(peak_areas_filtered, max_zero_ratio=0.5)
n_after_zero = peak_areas_imputed.shape[0]

print(f"Features before: {n_before_zero}")
print(f"Features after:  {n_after_zero}")
print(f"Removed:         {n_before_zero - n_after_zero}")
print(f"Remaining zeros: {(peak_areas_imputed == 0).sum().sum()}")

Features before: 6555
Features after:  4195
Removed:         2360
Remaining zeros: 0

2.8 Step 5: PQN Normalization

Probabilistic Quotient Normalization (PQN)

PQN is preferred over simpler methods like Total Sum Normalization (TSN) because it is robust to the presence of large biological changes. The algorithm:

Calculate a reference spectrum (median across all samples)
For each sample, compute the quotient of each feature to the reference
The median quotient becomes the normalization factor

This captures the overall dilution effect without being influenced by features that change due to the biological response.

Human note:

Here, Claude’s code uses PQN as suggested by Boccard & Rudaz, 2018 (along with TSN and MSTUS).
In the original notebook, they consider PQN but, in fact, they end up using only autoscale.
This choice is quite surprising to me and goes against Boccard’s recommendations, so I stick with PQN.
Anyway, Pareto scaling comes afterward, and it is quite similar to autoscaling (division by the square root of the standard deviation).

# Transpose to samples x features (PQN expects this orientation)
X = peak_areas_imputed.values.T

# Total intensity per sample before normalization
total_before = X.sum(axis=1)

X_norm = pqn_normalize(X)

total_after = X_norm.sum(axis=1)

# Before PQN
fig, ax = plt.subplots(figsize=(7, 5))
ax.boxplot(total_before, vert=True)
ax.set_title("Before PQN")
ax.set_ylabel("Total intensity per sample")
ax.set_xticks([])
plt.tight_layout()
plt.show()

# After PQN
fig, ax = plt.subplots(figsize=(7, 5))
ax.boxplot(total_after, vert=True)
ax.set_title("After PQN")
ax.set_ylabel("Total intensity per sample")
ax.set_xticks([])
plt.tight_layout()
plt.show()

Interpreting the normalization

After PQN, sample total intensities should be more uniform. However, PQN does not make all samples equal in total intensity (that would be TSN). The remaining variation reflects genuine biological differences.

2.9 Step 6: Pareto Scaling

Pareto scaling: a balanced approach

Pareto scaling divides each mean-centered feature by the square root of its standard deviation. This is a middle ground between:

No scaling: high-abundance features dominate multivariate analysis
Unit variance (UV) scaling: all features weighted equally, amplifying noise in low-abundance features

Pareto scaling reduces the impact of large peaks while preserving more of the original data structure than UV scaling.

variance_before = X_norm.var(axis=0)
X_scaled = pareto_scale(X_norm)
variance_after = X_scaled.var(axis=0)

# Before Pareto scaling
fig, ax = plt.subplots(figsize=(7, 5))
ax.hist(np.log10(variance_before + 1e-10), bins=50, edgecolor="white", alpha=0.8)
ax.set_xlabel("log₁₀(variance)")
ax.set_ylabel("Number of features")
ax.set_title("Before Pareto Scaling")
plt.tight_layout()
plt.show()

# After Pareto scaling
fig, ax = plt.subplots(figsize=(7, 5))
ax.hist(np.log10(variance_after + 1e-10), bins=50, edgecolor="white", alpha=0.8, color="C1")
ax.set_xlabel("log₁₀(variance)")
ax.set_ylabel("Number of features")
ax.set_title("After Pareto Scaling")
plt.tight_layout()
plt.show()

2.10 Step 7: Full Pipeline

Now let’s run the complete pipeline using build_preprocessing_pipeline(), which applies all the steps above to both forearm and forehead datasets in one call.

result = build_preprocessing_pipeline(
    forearm_path=forearm_path,
    forehead_path=forehead_path,
    metadata_path=metadata_path,
    blank_ratio=blank_ratio,
    max_zero_ratio=0.5,
)

for name in ["forearm", "forehead"]:
    X = result[name]["X_processed"]
    print(f"{name.capitalize()}: {X.shape[0]} samples x {X.shape[1]} features")

Forearm: 84 samples x 4195 features
Forehead: 84 samples x 2924 features

# Reconstruct feature counts at each step by re-running steps individually
pipeline_counts = {}
for name, path in [("Forearm", forearm_path), ("Forehead", forehead_path)]:
    meta_full = load_metadata(metadata_path)
    fm, pa = load_gnps_quant_table(path)
    n_raw = pa.shape[0]

    pa_filt, meta_nb = filter_blanks(pa, meta_full, blank_ratio=blank_ratio)
    n_blank_filtered = pa_filt.shape[0]

    pa_imp = handle_zeros(pa_filt, max_zero_ratio=0.5)
    n_zero_handled = pa_imp.shape[0]

    pipeline_counts[name] = {
        "Raw": n_raw,
        "After blank filter": n_blank_filtered,
        "After zero handling": n_zero_handled,
    }

# Plot
steps = ["Raw", "After blank filter", "After zero handling"]
x = np.arange(len(steps))
width = 0.35

fig, ax = plt.subplots(figsize=(9, 5))
forearm_counts = [pipeline_counts["Forearm"][s] for s in steps]
forehead_counts = [pipeline_counts["Forehead"][s] for s in steps]

bars1 = ax.bar(x - width / 2, forearm_counts, width, label="Forearm")
bars2 = ax.bar(x + width / 2, forehead_counts, width, label="Forehead")

ax.set_ylabel("Number of features")
ax.set_title("Feature Retention Through Preprocessing Pipeline")
ax.set_xticks(x)
ax.set_xticklabels(steps)
ax.legend()

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f"{int(height)}",
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3), textcoords="offset points",
                    ha="center", va="bottom", fontsize=9)

plt.tight_layout()
plt.show()

Figure 2.5: Number of features retained at each preprocessing step for both datasets.

2.11 Summary

Pipeline recap

We applied the following preprocessing steps to untargeted LC-MS metabolomics data:

Step	Purpose	Method
Blank filtering	Remove contaminant features	Keep features with sample/blank ratio ≥ 3
Zero handling	Remove unreliable features, impute missing values	Remove features with >50% zeros, impute with min/2
PQN normalization	Correct for dilution effects	Divide by median quotient to reference spectrum
Pareto scaling	Balance feature influence	Mean-center, divide by √(std)

Key takeaways

Order matters: blank filtering should come first, scaling last
PQN > TSN: PQN is more robust to large biological changes
Pareto scaling is a good default for metabolomics — it balances between no scaling and unit variance
Always visualize the effect of each step to catch issues early
The pipeline reduced features from thousands to hundreds of reliable, well-behaved signals ready for multivariate analysis

2.12 Save Processed Data for Downstream Analysis

Save the preprocessed data using pickle (Python’s equivalent to R’s .rds format) for use in subsequent analyses (PCA, PLS-DA, biomarker discovery, etc.).

import pickle

# Create output directory
output_dir = project_root / "data"
output_dir.mkdir(exist_ok=True)

# Save complete result dictionary with all datasets
pickle_path = output_dir / "preprocessing_result.pkl"
with open(pickle_path, "wb") as f:
    pickle.dump(result, f, protocol=pickle.HIGHEST_PROTOCOL)

print(f"✓ Saved preprocessing results to: {pickle_path}")
print(f"\nSaved objects:")
print(f"  - forearm: {result['forearm']['X_processed'].shape[0]} samples × {result['forearm']['X_processed'].shape[1]} features")
print(f"  - forehead: {result['forehead']['X_processed'].shape[0]} samples × {result['forehead']['X_processed'].shape[1]} features")
print(f"  - metadata for both datasets")
print(f"  - feature metadata for both datasets")

✓ Saved preprocessing results to: /home/runner/work/chemometrics_vibe_coding_public/chemometrics_vibe_coding_public/data/preprocessing_result.pkl

Saved objects:
  - forearm: 84 samples × 4195 features
  - forehead: 84 samples × 2924 features
  - metadata for both datasets
  - feature metadata for both datasets

Loading preprocessed data for downstream analysis

To use this file in your next analysis script:

import pickle

# Load the complete results
with open("data/preprocessing_result.pkl", "rb") as f:
    result = pickle.load(f)

# Access preprocessed data
X_forearm = result["forearm"]["X_processed"]        # samples × features array
X_forehead = result["forehead"]["X_processed"]

# Access metadata
metadata_forearm = result["forearm"]["metadata"]    # sample metadata DataFrame
feature_meta_forearm = result["forearm"]["feature_metadata"]  # feature annotations

# Access identifiers
sample_names = result["forearm"]["sample_names"]    # list of sample IDs
feature_ids = result["forearm"]["feature_ids"]      # list of feature IDs

What’s included in the pickle file:

For each dataset (forearm and forehead): - X_processed: Preprocessed data matrix (numpy array, samples × features) - metadata: Sample metadata (DataFrame: subject, timepoint, location, etc.) - feature_metadata: Feature annotations (DataFrame: m/z, RT, ion mobility, etc.) - sample_names: Sample identifiers (list) - feature_ids: Feature identifiers (list) - peak_areas_raw: Raw peak areas after blank/zero filtering (DataFrame)

Plus metadata_full: Complete metadata for all samples including blanks.