FAQ and Troubleshooting
Objectives
Find solutions to common installation, training, and evaluation issues
Understand the design decisions behind key project choices
Know where to look when results are unexpected
Get guidance on extending the project for new experiments
How to use this page
This episode is organized into four sections: Installation and Setup, Data and PDE, Training and Models, and Results and Evaluation. Use Ctrl+F to search for specific error messages or keywords.
Installation and Setup
Q: TensorFlow fails to install or import
Symptoms: pip install tensorflow fails, or import tensorflow raises errors about missing shared libraries.
Solutions:
Ensure Python 3.8+ is installed:
python3 --versionOn NAIC VMs, run
./setup.shwhich handles version detectionIf you see
libcudart.soerrors, these are GPU-related and can be ignored for CPU-only useFor ARM-based machines (e.g., Apple Silicon), install
tensorflow-macosinstead
Q: pypardiso will not install
Symptoms: pip install pypardiso fails with compiler errors or missing MKL libraries.
Solutions:
Install Intel MKL first:
pip install mklOn Ubuntu/Debian:
sudo apt install -y libmkl-devOn systems without MKL (ARM, some HPC nodes), pypardiso cannot be used. Modify the solver to use
scipy.sparse.linalg.spsolveas a fallbackIf using conda:
conda install -c conda-forge pypardiso
Q: GPU is not detected by TensorFlow
Symptoms: tf.config.list_physical_devices('GPU') returns an empty list.
Solutions:
Verify the GPU is visible to the OS:
nvidia-smiCheck CUDA toolkit:
nvcc --versionRun
./setup.shwhich creates CUDA library symlinksEnsure
tensorflow[and-cuda]is installed (not plaintensorflow)GPU is recommended but not required – all experiments run on CPU
Q: Jupyter Lab is not accessible from my browser
Symptoms: Opening http://localhost:8888 shows “connection refused” or times out.
Solutions:
Verify Jupyter is running on the VM:
ps aux | grep jupyterCheck the SSH tunnel is active on your local machine
Ensure port 8888 is not already in use (try port 9999 instead)
If using a VPN, the tunnel may not route correctly – try disabling it
Start Jupyter with explicit IP binding:
--ip=127.0.0.1
Q: SSH connection is refused or times out
Solutions:
Verify the VM is running at orchestrator.naic.no
Check that your current IP is whitelisted in the Orchestrator
Set correct key permissions:
chmod 600 /path/to/key.pemIf the VM was reprovisioned:
ssh-keygen -R <old_ip>
Data and PDE
Q: What PDE is being solved?
The project solves for a scalar field $u$ (like temperature or concentration) in the steady-state equation on $[-1, 1]^2$:
$$\nabla \cdot (\mathbf{v}u) - \nabla \cdot (D \nabla u) = 0$$
Diffusion ($D$): Kept constant at 1.0.
Advection ($\mathbf{v}$): The velocity field is derived from a Streamfunction $\psi$, in a way that makes it divergence-free.
Dirichlet boundary conditions are applied. The streamfunction is parameterized by a small number of coefficients (typically 10-20), making it a natural low-dimensional representation of the PDE family.
Q: Why divergence-free velocity fields?
Divergence-free fields (div(b) = 0) satisfy mass conservation. Using a streamfunction parameterization guarantees this property by construction: if b = curl(psi), then div(b) = div(curl(psi)) = 0. This ensures physically meaningful velocity fields without additional constraints.
Q: Why multiple grid resolutions?
Multiple resolutions serve two purposes:
Multi-resolution consistency testing: Does the latent representation capture the physics regardless of grid fineness?
Cross-resolution transfer: Can we encode a coarse solution and decode a fine one (a form of learned super-resolution)?
The resolutions 16x16 through 256x256 span a 256x range in degrees of freedom, from very coarse to moderately fine.
Q: How are the PDE solutions computed?
Solutions are computed using the finite element method (FEM):
Assemble stiffness and convection matrices for the given grid
Apply boundary conditions
Solve the resulting sparse linear system using
pypardiso(Intel MKL Pardiso solver)Store the solution as a flat NumPy array, reshaped to the grid dimensions
Q: I see data/*.npy files missing
Run the data generation step first:
python src/cd2d_streamfunc.py
Data files are not committed to the repository because they are large and easily regenerated.
Training and Models
Q: Why autoencoders instead of neural operators?
This project targets representation learning, not surrogate modeling:
Neural operators (DeepONet, FNO) learn the map from parameters to solutions
Autoencoders learn a compressed representation of the solution space itself
The autoencoder approach enables cross-modal transfer, interpolation in latent space, and resolution-invariant embeddings – capabilities that neural operators do not directly provide.
Q: What is the typical latent dimension?
The default latent dimension is 32 across all modalities. This was chosen as a balance:
Large enough to capture the dominant variation in the solution manifold
Small enough to enable efficient alignment and meaningful geometric analysis
Same dimension for all modalities (required for direct latent comparison)
Q: Why Relative Energy Error (REE) instead of MSE?
PDE solutions can vary over many orders of magnitude depending on the parameters. Standard MSE would be dominated by high-amplitude solutions and insensitive to errors in low-amplitude ones. REE normalizes by the solution energy, making the metric scale-invariant:
REE = || x - x_hat ||^2 / || x ||^2
Q: Training loss is not decreasing
Possible causes and solutions:
Learning rate too high: Reduce by a factor of 10
Latent dimension too small: Try increasing from 32 to 64
Data issue: Verify that data files loaded correctly (check shapes, NaN values)
Architecture mismatch: For high-resolution grids, the convolutional autoencoder may need more layers
Q: Training loss decreases but validation loss increases
This indicates overfitting. Solutions:
Reduce model capacity (fewer layers, fewer filters)
Add dropout or regularization
Increase the training set size (generate more PDE samples)
Use early stopping based on validation REE
Q: What is latent whitening?
The streamfunction autoencoder applies latent whitening regularization: a penalty term that encourages the latent distribution to have zero mean and identity covariance. This prevents the latent space from collapsing to a low-dimensional subspace and makes alignment easier.
Results and Evaluation
Q: Cross-modal REE is much higher than same-modal REE
This usually means the alignment stage did not converge properly. Check:
Were the Level 1 autoencoders well-trained? (same-modal REE should be low)
Did the alignment loss decrease during training?
Try running alignment for more epochs
Check that all modality latent vectors have similar magnitude distributions (L2 norms)
Q: One modality has dramatically higher error than others
Common causes:
Resolution too high: The 256x256 autoencoder may need a deeper architecture
Insufficient training: That modality’s autoencoder did not converge
Data issue: Check for NaN or inf values in that modality’s data files
Alignment outlier: The modality’s latent space may have a fundamentally different structure
Q: Test error is much higher than training error
This indicates overfitting. The most likely culprit:
Small dataset: Generate more PDE samples
Model too large: Reduce the number of parameters
No early stopping: Add validation-based early stopping
Q: Reconstructed solutions show checkerboard artifacts
This is a known issue with transposed convolutions in the decoder. Solutions:
Replace transposed convolutions with upsampling + convolution (resize-convolution)
Add a smoothing layer after the decoder output
Reduce the decoder’s stride to 1 and use explicit upsampling
Q: Reconstructed solutions are too smooth
The autoencoder may be losing high-frequency information. Solutions:
Increase the latent dimension (try 64 or 128)
Add skip connections (U-Net style) to the autoencoder
Use a perceptual or gradient-based loss in addition to REE
Q: How do I visualize the latent space?
The latent space is 32-dimensional, so direct visualization requires projection:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# lat_2_u32 and lat_2_u128 are (N, 32) arrays
pca = PCA(n_components=2)
z_all = np.vstack([lat_2_u32, lat_2_u128])
z_proj = pca.fit_transform(z_all)
plt.scatter(z_proj[:N, 0], z_proj[:N, 1], label='u_32', alpha=0.5)
plt.scatter(z_proj[N:, 0], z_proj[N:, 1], label='u_128', alpha=0.5)
plt.legend()
plt.title('PCA of aligned latent vectors')
plt.show()
If the modalities form separate clusters, alignment has failed. If they intermingle, alignment is working.
Extending the Project
Q: Can I add a new grid resolution?
Yes. Modify cd2d_streamfunc.py to include the new resolution, then retrain all downstream stages. The pipeline is designed to be modular – adding a resolution means adding a new modality.
Q: Can I use a different PDE?
The autoencoder and alignment pipeline is PDE-agnostic. To use a different PDE:
Replace the data generation script with one that solves your PDE
Ensure output files follow the same format (
.npy, indexed consistently)Retrain all autoencoders and alignment from scratch
Q: Can I use PyTorch instead of TensorFlow?
The project currently uses TensorFlow/Keras. Porting to PyTorch would require rewriting the model definitions and training loops, but the data pipeline, evaluation metrics, and analysis scripts are framework-independent.
Keypoints
Most installation issues stem from TensorFlow or pypardiso dependencies – check Python version and Intel MKL availability
Data files are not committed to the repository; generate them with
cd2d_streamfunc.pyHigh cross-modal REE usually indicates alignment issues, not autoencoder issues
Overfitting manifests as train-test gap; address with more data, less capacity, or early stopping
Visual artifacts (smoothing, checkerboards) point to specific architectural changes
The pipeline is modular: new resolutions, PDEs, or frameworks can be substituted at individual stages