FAQ and Troubleshooting
Objectives
Find answers to common questions about the GNN demonstrator
Diagnose and resolve common installation and runtime errors
Understand model behavior: early stopping, region force model, attention performance
Know how to extend the demonstrator with new features or custom data
Frequently Asked Questions
What Python version do I need?
Python 3.10 or higher is required. The DGL library requires specific PyTorch/Python version combinations. You can check your version with:
python3 --version
Can I train without a GPU?
Yes. The training script automatically detects GPU availability. CPU training is slower but fully functional. Set --gpu -1 to force CPU mode even if a GPU is available:
python src/graph_classification/train_graph_classification_ais.py \
--data_folder data/ --gpu -1 --epochs 50
How long does training take?
Training time depends heavily on hardware:
Hardware |
Time per Epoch |
100 Epochs |
|---|---|---|
CPU (4 cores) |
~2 minutes |
~3 hours |
GPU (NVIDIA T4) |
~1 second |
~2 minutes |
GPU (NVIDIA A100) |
<0.5 second |
<1 minute |
Early stopping typically triggers well before the maximum epoch count, so actual training time is often shorter.
How does early stopping work?
Early stopping monitors the validation accuracy after each epoch. If the validation accuracy does not improve for patience consecutive epochs, training stops and the best model checkpoint is retained.
For example, with --patience 20:
Epoch 30: validation accuracy reaches 94.2% (new best)
Epochs 31-50: no improvement
Epoch 50: training stops; the model from epoch 30 is used
The best model is saved based on the highest validation accuracy, not the final epoch. This prevents overfitting and ensures the saved model generalizes well.
What is the region force model?
The region force model is a key architectural component that goes beyond standard GNN classification. Instead of directly using GNN output for classification, the pipeline uses an iterative refinement process:
An initial GCN produces a first classification estimate $u_0$
The region force model (GCN, GraphSAGE, or GAT) produces a correction signal $f$
These are combined iteratively: $u_{t+1} = \tanh(u_t + f \cdot \Delta t)$ for $T=3$ steps
This is inspired by neural ODE methods where the classification is refined over multiple “time steps” using a learned force field. The tanh activation keeps the output bounded while allowing the region force to push the classification toward the correct class over multiple iterations.
Why does GAT underperform compared to GCN and GraphSAGE?
GAT achieves 93.1% accuracy at lr=0.01, compared to 94.4% for GCN and GraphSAGE. Several factors contribute:
Chain graph regularity: In a chain graph, every non-endpoint node has exactly 2 neighbors (plus self-loop). The attention mechanism is most beneficial when neighborhood sizes and structures vary, which is not the case here.
Dropout: GAT uses 0.5 dropout on both features and attention coefficients. On a small graph with only 12 nodes, this aggressive regularization can discard useful information.
Learning rate sensitivity: At lr=0.025, GAT drops to 86.3% while GCN and GraphSAGE remain above 94%. The attention weight computation introduces additional parameters that are harder to optimize at higher learning rates.
Hidden dimension: GAT uses 32 hidden dimensions (vs. 64 for GCN/GraphSAGE) because multi-head attention multiplies the effective dimension by the number of heads.
What is the bootstrap index (bidx)?
The dataset includes 50 different train/val/test splits stored in bidx_ts12.npy with shape (50, N). Each row defines a different partition:
Value 1 = training sample
Value 2 = validation sample
Value 3 = test sample
Use --bootstrap_index N (0-49) to select a specific split. When not specified, a combined split is used. Running across multiple indices provides confidence intervals for reported accuracy.
Can I add new features?
Yes, but it requires modifying the data pipeline. The current input format is (N, 3, 12) – 3 features per time step. To add features:
Prepare the data: Create a new numpy array with shape
(N, K, 12)where K is the new feature countUpdate the data files: Save as
X_ts12.npy(same filename, new shape)No model changes needed: The model automatically reads
dim_nfeatsfrom the dataset, so it adapts to any feature dimension
Candidate additional features:
Acceleration: Rate of speed change, useful for detecting speed-up/slow-down patterns
Heading change rate: Angular velocity, complementary to curvature
Time of day: Encoded as sin/cos components, captures diurnal fishing patterns
Depth at position: Bathymetric data, since fishing often occurs at specific depth ranges
How can I use this with my own AIS data?
To use the demonstrator with your own AIS data:
Pre-process your AIS data into fixed-length segments of 12 time steps
Extract the three features (velocity, distance to shore, curvature) for each time step
Create numpy arrays:
X_ts12.npy: shape(N, 3, 12), float32y_ts12.npy: shape(N,), float32, values 0 or 1bidx_ts12.npy: shape(K, N), split indices (1=train, 2=val, 3=test)
Place the files in the
data/directoryRun training as usual
For the bootstrap indices, you can create a simple random split:
import numpy as np
N = len(your_labels)
bidx = np.zeros((1, N), dtype=int)
indices = np.random.permutation(N)
n_train = int(0.6 * N)
n_val = int(0.2 * N)
bidx[0, indices[:n_train]] = 1 # Training
bidx[0, indices[n_train:n_train+n_val]] = 2 # Validation
bidx[0, indices[n_train+n_val:]] = 3 # Test
np.save('data/bidx_ts12.npy', bidx)
Common Issues
Problem |
Solution |
|---|---|
|
Run |
|
Set PYTHONPATH: |
|
Create |
|
Reduce batch size: |
DGL version mismatch |
Ensure PyTorch and DGL CUDA versions match; delete |
Jupyter kernel not found |
Run |
|
Self-loops should be added automatically; check that |
|
Ensure |
Training loss not decreasing |
Try a lower learning rate ( |
GPU detected but not used |
Set |
Performance Tips
Speed Up Training
Use GPU: Training is 100x+ faster on GPU. Even a T4 GPU reduces 3-hour CPU runs to 2 minutes.
Increase batch size: Larger batches (1000-4000) reduce the number of gradient updates per epoch and improve GPU utilization. Adjust upward until GPU memory is fully utilized.
Reduce patience: Setting
--patience 20instead of 200 stops training sooner when the model has converged.Pin memory: Keep
--pin_memory True(default) for faster CPU-to-GPU data transfer.
Improve Accuracy
Try multiple learning rates: The default script iterates over
5e-2, 3e-2, 1e-2. Adding smaller rates like5e-3, 1e-3may help.Increase hidden dimensions: Use
--hidden 64or--hidden 128for more model capacity.Run multiple bootstrap splits: Average results across 5-10 splits for more reliable comparisons.
Increase depth: The models use 3 GNN layers by default. More layers capture longer-range dependencies but risk over-smoothing.
Reduce Memory Usage
Lower batch size:
--batch_size 100uses less GPU memory at the cost of slower training.Use CPU: Set
--gpu -1to avoid GPU memory constraints entirely.Reduce hidden size:
--hidden 16uses less memory but may reduce accuracy.
Getting Help
NAIC project: naic.no
DGL documentation: docs.dgl.ai
PyTorch documentation: pytorch.org/docs
NORCE: norceresearch.no
Repository issues: GitHub Issues
Keypoints
Python 3.10+ is required; GPU is optional but recommended (100x speedup)
Early stopping saves the best model checkpoint based on validation accuracy
The region force model iteratively refines classification using a neural ODE-inspired approach
GAT underperforms due to chain graph regularity, aggressive dropout, and learning rate sensitivity
New features can be added by changing the input array shape – no model code changes needed
Custom AIS data requires pre-processing into fixed-length segments with matching numpy formats
Most runtime errors are resolved by activating the virtual environment and setting PYTHONPATH
Batch size is the primary lever for trading memory usage against training speed