FAQ and Troubleshooting

Objectives

Find solutions to common installation, training, and evaluation issues
Understand the design decisions behind key project choices
Know where to look when results are unexpected
Get guidance on extending the project for new experiments

How to use this page

This episode is organized into four sections: Installation and Setup, Data and PDE, Training and Models, and Results and Evaluation. Use Ctrl+F to search for specific error messages or keywords.

Installation and Setup

Q: TensorFlow fails to install or import

Symptoms: pip install tensorflow fails, or import tensorflow raises errors about missing shared libraries.

Solutions:

Ensure Python 3.8+ is installed: python3 --version
On NAIC VMs, run ./setup.sh which handles version detection
If you see libcudart.so errors, these are GPU-related and can be ignored for CPU-only use
For ARM-based machines (e.g., Apple Silicon), install tensorflow-macos instead

Q: pypardiso will not install

Symptoms: pip install pypardiso fails with compiler errors or missing MKL libraries.

Solutions:

Install Intel MKL first: pip install mkl
On Ubuntu/Debian: sudo apt install -y libmkl-dev
On systems without MKL (ARM, some HPC nodes), pypardiso cannot be used. Modify the solver to use scipy.sparse.linalg.spsolve as a fallback
If using conda: conda install -c conda-forge pypardiso

Q: GPU is not detected by TensorFlow

Symptoms: tf.config.list_physical_devices('GPU') returns an empty list.

Solutions:

Verify the GPU is visible to the OS: nvidia-smi
Check CUDA toolkit: nvcc --version
Run ./setup.sh which creates CUDA library symlinks
Ensure tensorflow[and-cuda] is installed (not plain tensorflow)
GPU is recommended but not required – all experiments run on CPU

Q: Jupyter Lab is not accessible from my browser

Symptoms: Opening http://localhost:8888 shows “connection refused” or times out.

Solutions:

Verify Jupyter is running on the VM: ps aux | grep jupyter
Check the SSH tunnel is active on your local machine
Ensure port 8888 is not already in use (try port 9999 instead)
If using a VPN, the tunnel may not route correctly – try disabling it
Start Jupyter with explicit IP binding: --ip=127.0.0.1

Q: SSH connection is refused or times out

Solutions:

Verify the VM is running at orchestrator.naic.no
Check that your current IP is whitelisted in the Orchestrator
Set correct key permissions: chmod 600 /path/to/key.pem
If the VM was reprovisioned: ssh-keygen -R <old_ip>

Data and PDE

Q: What PDE is being solved?

The project solves for a scalar field $u$ (like temperature or concentration) in the steady-state equation on $[-1, 1]^2$:

$$\nabla \cdot (\mathbf{v}u) - \nabla \cdot (D \nabla u) = 0$$

Diffusion ($D$): Kept constant at 1.0.
Advection ($\mathbf{v}$): The velocity field is derived from a Streamfunction $\psi$, in a way that makes it divergence-free.

Dirichlet boundary conditions are applied. The streamfunction is parameterized by a small number of coefficients (typically 10-20), making it a natural low-dimensional representation of the PDE family.

Q: Why divergence-free velocity fields?

Divergence-free fields (div(b) = 0) satisfy mass conservation. Using a streamfunction parameterization guarantees this property by construction: if b = curl(psi), then div(b) = div(curl(psi)) = 0. This ensures physically meaningful velocity fields without additional constraints.

Q: Why multiple grid resolutions?

Multiple resolutions serve two purposes:

Multi-resolution consistency testing: Does the latent representation capture the physics regardless of grid fineness?
Cross-resolution transfer: Can we encode a coarse solution and decode a fine one (a form of learned super-resolution)?

The resolutions 16x16 through 256x256 span a 256x range in degrees of freedom, from very coarse to moderately fine.

Q: How are the PDE solutions computed?

Solutions are computed using the finite element method (FEM):

Assemble stiffness and convection matrices for the given grid
Apply boundary conditions
Solve the resulting sparse linear system using pypardiso (Intel MKL Pardiso solver)
Store the solution as a flat NumPy array, reshaped to the grid dimensions

Q: I see `data/*.npy` files missing

Run the data generation step first:

python src/cd2d_streamfunc.py

Data files are not committed to the repository because they are large and easily regenerated.

Training and Models

Q: Why autoencoders instead of neural operators?

This project targets representation learning, not surrogate modeling:

Neural operators (DeepONet, FNO) learn the map from parameters to solutions
Autoencoders learn a compressed representation of the solution space itself

The autoencoder approach enables cross-modal transfer, interpolation in latent space, and resolution-invariant embeddings – capabilities that neural operators do not directly provide.

Q: What is the typical latent dimension?

The default latent dimension is 32 across all modalities. This was chosen as a balance:

Large enough to capture the dominant variation in the solution manifold
Small enough to enable efficient alignment and meaningful geometric analysis
Same dimension for all modalities (required for direct latent comparison)

Q: Why Relative Energy Error (REE) instead of MSE?

PDE solutions can vary over many orders of magnitude depending on the parameters. Standard MSE would be dominated by high-amplitude solutions and insensitive to errors in low-amplitude ones. REE normalizes by the solution energy, making the metric scale-invariant:

REE = || x - x_hat ||^2 / || x ||^2

Q: Training loss is not decreasing

Possible causes and solutions:

Learning rate too high: Reduce by a factor of 10
Latent dimension too small: Try increasing from 32 to 64
Data issue: Verify that data files loaded correctly (check shapes, NaN values)
Architecture mismatch: For high-resolution grids, the convolutional autoencoder may need more layers

Q: Training loss decreases but validation loss increases

This indicates overfitting. Solutions:

Reduce model capacity (fewer layers, fewer filters)
Add dropout or regularization
Increase the training set size (generate more PDE samples)
Use early stopping based on validation REE

Q: What is latent whitening?

The streamfunction autoencoder applies latent whitening regularization: a penalty term that encourages the latent distribution to have zero mean and identity covariance. This prevents the latent space from collapsing to a low-dimensional subspace and makes alignment easier.

Results and Evaluation

Q: One modality has dramatically higher error than others

Common causes:

Resolution too high: The 256x256 autoencoder may need a deeper architecture
Insufficient training: That modality’s autoencoder did not converge
Data issue: Check for NaN or inf values in that modality’s data files
Alignment outlier: The modality’s latent space may have a fundamentally different structure

Q: Test error is much higher than training error

This indicates overfitting. The most likely culprit:

Small dataset: Generate more PDE samples
Model too large: Reduce the number of parameters
No early stopping: Add validation-based early stopping

Q: Reconstructed solutions show checkerboard artifacts

This is a known issue with transposed convolutions in the decoder. Solutions:

Replace transposed convolutions with upsampling + convolution (resize-convolution)
Add a smoothing layer after the decoder output
Reduce the decoder’s stride to 1 and use explicit upsampling

Q: Reconstructed solutions are too smooth

The autoencoder may be losing high-frequency information. Solutions:

Increase the latent dimension (try 64 or 128)
Add skip connections (U-Net style) to the autoencoder
Use a perceptual or gradient-based loss in addition to REE

Q: How do I visualize the latent space?

The latent space is 32-dimensional, so direct visualization requires projection:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# lat_2_u32 and lat_2_u128 are (N, 32) arrays
pca = PCA(n_components=2)
z_all = np.vstack([lat_2_u32, lat_2_u128])
z_proj = pca.fit_transform(z_all)

plt.scatter(z_proj[:N, 0], z_proj[:N, 1], label='u_32', alpha=0.5)
plt.scatter(z_proj[N:, 0], z_proj[N:, 1], label='u_128', alpha=0.5)
plt.legend()
plt.title('PCA of aligned latent vectors')
plt.show()

If the modalities form separate clusters, alignment has failed. If they intermingle, alignment is working.

Extending the Project

Q: Can I add a new grid resolution?

Yes. Modify cd2d_streamfunc.py to include the new resolution, then retrain all downstream stages. The pipeline is designed to be modular – adding a resolution means adding a new modality.

Q: Can I use a different PDE?

The autoencoder and alignment pipeline is PDE-agnostic. To use a different PDE:

Replace the data generation script with one that solves your PDE
Ensure output files follow the same format (.npy, indexed consistently)
Retrain all autoencoders and alignment from scratch

Q: Can I use PyTorch instead of TensorFlow?

The project currently uses TensorFlow/Keras. Porting to PyTorch would require rewriting the model definitions and training loops, but the data pipeline, evaluation metrics, and analysis scripts are framework-independent.

Keypoints

Most installation issues stem from TensorFlow or pypardiso dependencies – check Python version and Intel MKL availability
Data files are not committed to the repository; generate them with cd2d_streamfunc.py
High cross-modal REE usually indicates alignment issues, not autoencoder issues
Overfitting manifests as train-test gap; address with more data, less capacity, or early stopping
Visual artifacts (smoothing, checkerboards) point to specific architectural changes
The pipeline is modular: new resolutions, PDEs, or frameworks can be substituted at individual stages

FAQ and Troubleshooting

Installation and Setup

Q: TensorFlow fails to install or import

Q: pypardiso will not install

Q: GPU is not detected by TensorFlow

Q: Jupyter Lab is not accessible from my browser

Q: SSH connection is refused or times out

Data and PDE

Q: What PDE is being solved?

Q: Why divergence-free velocity fields?

Q: Why multiple grid resolutions?

Q: How are the PDE solutions computed?

Q: I see data/*.npy files missing

Training and Models

Q: Why autoencoders instead of neural operators?

Q: What is the typical latent dimension?

Q: Why Relative Energy Error (REE) instead of MSE?

Q: Training loss is not decreasing

Q: Training loss decreases but validation loss increases

Q: What is latent whitening?

Results and Evaluation

Q: Cross-modal REE is much higher than same-modal REE

Q: One modality has dramatically higher error than others

Q: Test error is much higher than training error

Q: Reconstructed solutions show checkerboard artifacts

Q: Reconstructed solutions are too smooth

Q: How do I visualize the latent space?

Extending the Project

Q: Can I add a new grid resolution?

Q: Can I use a different PDE?

Q: Can I use PyTorch instead of TensorFlow?

Q: I see `data/*.npy` files missing