AIS Data and Graph Construction

Objectives

  • Understand the Automatic Identification System (AIS) and its role in maritime monitoring

  • Know why velocity, distance to shore, and curvature are discriminative features

  • Understand the graph construction code that converts numpy arrays to DGL graphs

  • Know the dataset statistics and bootstrap split methodology

AIS Data Availability

AIS data is one of the most accessible sources of maritime intelligence. The International Maritime Organization (IMO) requires all vessels over 300 gross tons on international voyages to carry AIS transponders. In Norwegian waters, the Norwegian Coastal Administration provides historical AIS data through kystverket.no, and global AIS feeds are available from providers like MarineTraffic and Spire. This demonstrator uses pre-processed AIS data from Norwegian coastal waters, where fishing activity is a significant component of vessel traffic.

Automatic Identification System (AIS)

AIS is a maritime tracking system originally designed for collision avoidance. Vessels broadcast their position and status at regular intervals (every 2-30 seconds depending on speed). Each AIS message contains:

Field

Description

Update Rate

MMSI

Unique vessel identifier

Static

Position (lat, lon)

GPS coordinates

2-30 seconds

Speed over ground (SOG)

Vessel speed in knots

2-30 seconds

Course over ground (COG)

Direction of travel

2-30 seconds

Heading

Direction the bow points

2-30 seconds

Vessel type

Ship category code

Static

Navigation status

Underway, anchored, fishing, etc.

Variable

For classification purposes, raw AIS messages are aggregated into fixed-length time-series segments representing vessel behavior over a defined time window.

Feature Engineering

From raw AIS data, three features are extracted for each of the 12 time steps. These features were chosen because they capture distinct aspects of fishing behavior:

Feature

Description

Why It Is Discriminative

Velocity

Speed of the vessel (derived from SOG)

Fishing vessels typically operate at lower and more variable speeds than transit vessels. Trawlers maintain 2-5 knots while dragging nets, compared to 10-15 knots for transit. Speed variability is also higher during fishing as vessels adjust to catch conditions.

Distance to shore

Proximity to the coastline

Fishing often occurs in specific zones – continental shelves, banks, and areas with known fish aggregations. Coastal fishing vessels operate within 12-50 nautical miles, while transit vessels often take more direct offshore routes.

Curvature

Rate of course change (derived from COG differences)

Fishing involves more frequent and sharper turns than transit. Vessels circling fish schools, setting longlines, or trawling in patterns produce high curvature values. Transit vessels maintain nearly straight courses with curvature close to zero.

Feature distributions comparing fishing and non-fishing vessels

Distribution of the three features for fishing (orange) and non-fishing (blue) vessel trajectories. Fishing vessels show lower, more variable speeds; closer proximity to shore; and higher trajectory curvature.

Graph Construction Code

The AISTimeseriesDataset class in src/graph_classification/ais_timeseries_dataset.py converts numpy arrays to DGL graphs. Here is the core graph construction logic:

import dgl
import torch
import numpy as np

def build_graph_from_features(features, num_timesteps=12):
    """Convert a single AIS trajectory to a DGL graph.

    Args:
        features: numpy array of shape (3, 12) -- 3 features x 12 time steps
        num_timesteps: number of time steps (nodes in the graph)

    Returns:
        DGL graph with node features and self-loops
    """
    # Create sequential edges: 0->1, 1->2, ..., 10->11
    edge_list = [(i, i + 1) for i in range(num_timesteps - 1)]

    # Transpose features to (12, 3) -- one row per node
    node_features = torch.tensor(features.T, dtype=torch.float32)

    # Create the DGL graph
    graph = dgl.graph(edge_list)

    # Assign node features
    graph.ndata['attr'] = node_features

    # Add self-loops to avoid 0-in-degree errors during message passing
    graph = dgl.add_self_loop(graph)

    return graph

Each constructed graph has the following structure:

Property

Value

Nodes

12 (one per time step)

Sequential edges

11 (chain: 0-1, 1-2, …, 10-11)

Self-loop edges

12 (one per node)

Total edges

23

Node feature dimension

3 (velocity, distance to shore, curvature)

The full dataset is loaded and converted in batch by the AISTimeseriesDataset class, which extends DGLDataset and handles caching, saving, and loading of the graph objects.

Dataset Statistics

The dataset contains approximately 23,500 AIS trajectory samples collected from Norwegian coastal waters:

Split

Samples

Percentage

Purpose

Training

14,100

60%

Model training

Validation

4,700

20%

Hyperparameter tuning, early stopping

Test

4,700

20%

Final evaluation

Total

~23,500

100%

Class Distribution

The dataset is approximately balanced between the two classes:

Class

Label

Description

Non-fishing

0

Transit, anchored, maneuvering, or other non-fishing activities

Fishing

1

Active fishing operations (trawling, longlining, purse seining, etc.)

Data Format

The raw data is stored as three numpy files:

import numpy as np

# Features: (N, 3, 12) -- N samples, 3 features, 12 time steps
X = np.load('data/X_ts12.npy')
print(f'Features shape: {X.shape}')  # (23500, 3, 12)

# Labels: (N,) -- binary classification
y = np.load('data/y_ts12.npy')
print(f'Labels shape: {y.shape}')    # (23500,)
print(f'Class distribution: {np.bincount(y.astype(int))}')

# Bootstrap indices: (50, N) -- 50 different splits
bidx = np.load('data/bidx_ts12.npy')
print(f'Bootstrap shape: {bidx.shape}')  # (50, 23500)

Bootstrap Split Methodology

The dataset includes 50 pre-computed train/val/test splits (bootstrap indices) to enable robust evaluation. Each split assigns every sample one of three roles:

Index Value

Role

Description

1

Training

Used for model weight updates

2

Validation

Used for early stopping and hyperparameter selection

3

Test

Used for final accuracy reporting

Using the --bootstrap_index N flag (0-49) selects a specific split. When no bootstrap index is specified (default), a combined split is used. Running experiments across multiple bootstrap indices provides confidence intervals for the reported accuracy.

Keypoints

  • AIS is a mandatory maritime tracking system that provides position, speed, and course data

  • Three features are extracted: velocity, distance to shore, and curvature – each captures a different aspect of fishing behavior

  • Each trajectory is converted to a chain graph with 12 nodes, 11 sequential edges, and 12 self-loops

  • The dataset contains ~23,500 samples with approximately balanced classes

  • 50 bootstrap splits enable robust evaluation with confidence intervals

  • The AISTimeseriesDataset class handles conversion from numpy arrays to DGL graphs with caching