Analyzing Results

Objectives

Understand the output files and metrics
Interpret training curves: healthy convergence, warning signs, early stopping
Interpret OOD evaluation results across all 6 models
Understand why validation MAE can be misleading (Val-OOD gap)
Compare teacher vs student vs baselines performance
Interpret ablation study results (architecture + hyperparameter)
Analyze learned physics parameters
Reproduce the model from its 12 parameters

Output Files

After training, the results/ directory contains:

File	Description
`results.json`	Full results with metrics and learned parameters
`best_teacher.pt`	Best teacher model checkpoint
`best_student_alpha0.1.pt`	Best student model checkpoint
`ablation/`	Per-experiment ablation results (if run)

Key Metrics

MAE (Mean Absolute Error) in mV

The primary metric is MAE in millivolts:

Val MAE: Performance on validation data (keep-out region: 76-80C)
Test2 MAE: OOD performance on current sweep data
Test3 MAE: OOD performance on pressure swap data
OOD Average: (Test2 + Test3) / 2 – the main comparison metric

Expected Results (All 6 Models)

Rank	Model	Params	Val MAE	Test2	Test3	OOD Avg
1	Distilled Student	12	~14 mV	~15 mV	~16 mV	~15 mV
2	Teacher	~9K	~14 mV	~43 mV	~14 mV	~28 mV
3	Pure Physics	12	~25 mV	~18 mV	~23 mV	~20 mV
4	PureMLP	~2K	~13 mV	~45 mV	~38 mV	~42 mV
5	BigMLP	~50K	~12 mV	~52 mV	~42 mV	~47 mV
6	Transformer	~50K	~10 mV	~68 mV	~55 mV	~62 mV

OOD performance bar chart and parameters vs performance — Left: Out-of-distribution MAE comparison (lower is better). The 12-parameter student (17 mV) beats all neural networks including the 100K-parameter Transformer (45 mV). Right: Parameters vs OOD performance — the student occupies the ideal lower-left corner.

Complete model comparison summary with all metrics. The 12-parameter student wins on OOD generalization (16.7 mV), outperforming the 8,961-parameter Pure MLP by 2.6x.

Key insights:

More parameters does NOT mean better OOD: The Transformer (50K params) has the best validation MAE but the worst OOD performance
Physics constraints beat raw capacity: The 12-param student beats all 50K-param baselines on OOD
Distillation helps: The distilled student (alpha=0.1) outperforms the pure physics model (alpha=1.0) by 5-10 mV on OOD
Validation MAE is misleading: Low validation MAE does not predict OOD performance; OOD Average is the correct metric

Interpreting Training Curves

Training loss and validation MAE curves — Teacher training curves: (Left) MSE loss decreasing smoothly. (Right) Validation MAE converging to 13.85 mV with the best epoch marked by the dashed red line.

Healthy Teacher Training

A well-trained teacher shows:

Rapid initial improvement: Loss drops sharply in the first 10-20 epochs
Plateau and convergence: Loss stabilizes, validation MAE reaches ~13-15 mV
Early stopping: Typically triggers at epoch ~17-30 (training runs 30 more epochs after best, then stops)
Learning rate decay: CosineAnnealingLR smoothly reduces LR from 0.01 to 0.0001

Warning signs:

Validation MAE increases while training loss decreases → overfitting (check weight decay, consider more dropout)
Loss does not decrease at all → learning rate too low or data loading issue
Loss oscillates wildly → learning rate too high or gradient clipping not working

Healthy Student Training

Student convergence looks different from the teacher:

Slower start: The 12-param model needs more epochs to find good physics parameter values
Dual loss tracking: Both label loss (L) and distillation loss (D) should decrease together
Plateau-triggered LR drops: ReduceLROnPlateau halves the learning rate when validation stalls — you may see sudden improvement after an LR reduction
Early stopping: Typically around epoch 100-160 (patience=40)

If the distillation loss (D) is much larger than the label loss (L), the student is struggling to match the teacher — this is normal early in training and should improve.

Validation MAE vs OOD MAE

A critical insight from the 6-model comparison:

Model	Val MAE	OOD Avg	Gap
Transformer	~10 mV	~62 mV	52 mV
BigMLP	~12 mV	~47 mV	35 mV
Distilled Student	~14 mV	~15 mV	1 mV

The gap between validation and OOD MAE reveals overfitting risk. Models with small gaps (like the student) generalize well. Large gaps indicate the model has memorized training patterns rather than learning physics.

Prediction scatter plots across validation and OOD datasets — Predicted vs actual voltage scatter plots: validation (left), Test2 OOD (center), Test3 OOD (right). Points close to the diagonal indicate accurate predictions. OOD scatter reveals where models fail.

Physics Parameters

The student model learns 6 physics parameters:

Parameter	Description	Expected Range	Typical Value
i_lim_ref	Limiting current density	0.1-10 A/cm2	~1.3
i0_a_ref	Anode exchange current	1e-8 - 1e-2 A/cm2	~8e-5
i0_c_ref	Cathode exchange current	1e-6 - 1.0 A/cm2	~3.5e-3
R_ohm_ref	Ohmic resistance	0.01-1.0 Ohm*cm2	~0.99
alpha_a	Anode transfer coefficient	0.3-1.0	~0.64
alpha_c	Cathode transfer coefficient	0.3-1.0	~0.67

And 6 hybrid correction parameters:

Parameter	Description	Typical Value
corr_a	Logistic amplitude	~0.035
corr_b	Logistic steepness	~0.60
corr_c	Logistic midpoint (current)	~0.63
corr_d	Logistic offset	~0.020
corr_p	Pressure modulation	~1.85
corr_t	Temperature modulation	~0.32

Physical Interpretation of Learned Values

The paper validates each learned parameter against literature and domain knowledge:

Parameter	Learned	Literature Range	Interpretation
i_lim_ref	~1.15 A/cm2	0.8-1.5 A/cm2	Consistent with PEM diffusion limits
i0_a_ref	~0.61 mA/cm2	0.1-2 mA/cm2	Ir-based anode catalyst (4-electron mechanism, slower)
i0_c_ref	~8.3 mA/cm2	1-50 mA/cm2	Pt-based cathode catalyst (2-electron mechanism, faster)
R_ohm_ref	~1.0 Ohm*cm2	0.05-0.20 (standard PEM)	Higher than typical — due to SWVF technology
alpha_a	~0.46	~0.5 (theoretical)	Near theoretical value for symmetric barrier
alpha_c	~0.53	~0.5 (theoretical)	Near theoretical value for symmetric barrier

The anode exchange current is ~14x smaller than the cathode because the oxygen evolution reaction (4-electron process) has a much higher activation barrier than hydrogen evolution (2-electron process).

The elevated R_ohm (~1.0 vs typical 0.05-0.20 Ohm*cm2) is explained by the SWVF (Static Water Vapour Feed) technology used in this electrolyzer. The water feed barrier membrane adds ionic resistance not present in conventional liquid-fed PEM systems. This is not a model error — it correctly captures the technology-specific physics.

These are “effective” cell-level parameters (lumped values) rather than intrinsic catalyst-interface properties, because the model is fitted to averaged cell voltage across 4 cells in series.

Interpreting Ablation Results

Ablation results are saved as JSON files in results/ablation/, one per experiment-seed combination. Each file contains:

Model name, seed, and configuration
Validation MAE, Test2 MAE, Test3 MAE, OOD Average

Reading the Results Table

Results are reported as mean +/- std across 3 seeds:

Experiment          Val MAE (mV)    OOD Avg (mV)
baseline (SGD+KO)   13.9 ± 0.3     15.3 ± 0.8
adam                 12.1 ± 0.2     28.5 ± 2.1
no_keepout           13.5 ± 0.4     35.2 ± 3.5

Lower OOD Average is better. High standard deviation indicates sensitivity to random initialization.

Key Conclusions from Ablation

Architecture comparison (CLI --mode ablation):

The 12-param distilled student achieves the lowest OOD Average despite having ~4,000x fewer parameters than BigMLP
Pure-ML baselines achieve the best validation MAE but the worst OOD — they memorize rather than generalize
The gap between teacher (~28 mV OOD) and student (~15 mV OOD) shows that distillation into a physics-constrained model removes the teacher’s MLP-driven overfitting

Hyperparameter comparison (notebook Part 9):

Keep-out validation is the most impactful choice: removing it degrades OOD by 15-20+ mV
SGD vs Adam: SGD improves teacher OOD by ~10-15 mV (Adam overfits the MLP)
Weight decay: Modest effect (~2-3 mV improvement)
Learning rate: 0.01 is near-optimal; 0.05 risks divergence; 0.001 converges slowly but safely

The ablation studies validate that both architectural choices (physics constraints) and training choices (SGD, keep-out) contribute to OOD generalization.

Loading Results

import json

with open('results/results.json') as f:
    results = json.load(f)

print(f"Teacher OOD: {results['teacher']['ood_avg_mV']:.2f} mV")
print(f"Student OOD: {results['student']['ood_avg_mV']:.2f} mV")

# Access the 12 learned parameters
physics = results['student']['physics_params']
hybrid = results['student']['hybrid_params']
for k, v in physics.items():
    print(f"  {k}: {v}")
for k, v in hybrid.items():
    print(f"  {k}: {v}")

Reproducing the Model from 12 Numbers

The trained student can be reproduced as a pure NumPy function without PyTorch. The notebook (Part 11) demonstrates this, but here is the core idea:

import numpy as np

def predict_voltage(I, P_H2, P_O2, T_C, params):
    """Predict cell voltage from 12 parameters (no PyTorch needed)."""
    F, R, n = 96485.0, 8.314, 2
    T_ref, P_ref = 353.15, 20.0
    i = I / 50.0           # A -> A/cm2
    T = T_C + 273.15       # C -> K
    P_avg = (P_H2 + P_O2) / 2.0

    # Nernst potential
    E = 1.229 - 0.9e-3*(T-298.15) + R*T/(n*F)*np.log(P_H2*np.sqrt(P_O2)/0.05)

    # Activation (anode + cathode)
    inv_dT = 1/T - 1/T_ref
    i0_a = params['i0_a_ref'] * np.exp(-50000/R * inv_dT) * (T/T_ref)**0.2
    i0_c = params['i0_c_ref'] * np.exp(-60000/R * inv_dT) * (T/T_ref)**0.2
    eta_a = R*T/(params['alpha_a']*n*F) * np.arcsinh(i/(2*i0_a))
    eta_c = R*T/(params['alpha_c']*n*F) * np.arcsinh(i/(2*i0_c))

    # Ohmic
    R_ohm = params['R_ohm_ref'] * np.exp(-15000/R * inv_dT)
    eta_ohm = i * R_ohm

    # Concentration
    i_lim = np.clip(params['i_lim_ref'] * (P_avg/P_ref)**0.7 * (1+0.3*(T-T_ref)/T_ref), 0.1, 10)
    eta_conc = -R*T/(n*F) * np.log(1 - np.clip(i/i_lim, None, 0.99))

    # Hybrid correction
    sig = 1/(1+np.exp(-params['corr_b']*(i-params['corr_c'])))
    corr = np.clip((params['corr_a']*sig + params['corr_d']) *
                   (1 + params['corr_p']*(P_avg-P_ref)/P_ref +
                    params['corr_t']*(T-T_ref)/T_ref), -0.1, 0.1)

    return E + eta_a + eta_c + eta_ohm + eta_conc + corr

This function takes 4 operating conditions and 12 learned constants. It can be implemented in any language (MATLAB, Excel, C++) and produces identical results to the PyTorch model.

Keypoints

OOD Average (Test2+Test3)/2 is the main comparison metric
6 models are compared: distilled student wins on OOD despite having only 12 parameters
Low validation MAE does NOT predict OOD performance – the Val-OOD gap reveals overfitting
Healthy training: rapid initial improvement, plateau, early stopping before max epochs
Student shows dual losses (L=label, D=distillation) that both decrease during training
Ablation proves: keep-out validation is the most impactful choice, SGD beats Adam for teacher
Architecture matters more than hyperparameters: 12-param student beats 50K-param baselines
Physics parameters should be in physically reasonable ranges
The 12 learned constants fully define the model – no neural network needed
The model can be reproduced in any language from the 12 numbers