Machine Learning Methodology
Objectives
Understand the data preparation process
Learn about the available ML models
Know how results are analyzed
Methodology
Data Preparation
Data Collection: Data on 65 climate indices from NORCE Climate and Environment datasets (1845-2005) is used. Additional datasets can be added to the
datasetdirectory.Preprocessing: Data is normalized to a uniform scale (0-100) to ensure consistency across indices.
Dataset Splitting: The data is split into training (60%) and testing (40%) sets, ensuring the model is trained on historical data without information leaks.
Feature Engineering: Lagged features are introduced to capture temporal dependencies, and median values are used to handle noise and errors in the data.
Model Selection and Training
Several machine learning models are employed to predict climate indices:
Linear Regression (LR): A simple baseline model for linear relationships.
Random Forest Regressor: An ensemble model that improves accuracy and reduces overfitting by combining multiple decision trees.
Multi-Layer Perceptron (MLP) Regressor: A neural network model capable of capturing complex, non-linear relationships.
XGBoost Regressor: A gradient boosting algorithm optimized for both speed and performance, with GPU support to accelerate training.
VM Deployment
Handling large datasets and multiple models benefits from computational resources available on NAIC Orchestrator VMs:
GPU Acceleration: XGBoost and MLP models can leverage GPU acceleration when available on the VM.
Background Processing: Long-running experiments can be executed in tmux sessions to persist beyond SSH disconnection.
Interactive Analysis: Jupyter Lab provides an interactive environment for exploring results and running experiments.
Results Analysis
Performance Metrics: Models are evaluated using correlation coefficients for linear relationships and MAE for accuracy.
Visualization: Accuracy landscapes and feature importance plots are generated to interpret the model’s performance and identify key predictors.
Collaboration with Climate Scientists: Domain knowledge is incorporated to refine models and interpret results meaningfully.
Practical Implementation
Search Methodology
The main methodology involves systematically testing various ML models on climate data. The process includes generating lagged features, splitting data, applying wavelet filtering (optional), and ranking features by importance. This is done iteratively across different model configurations to find the best-performing combination.
Flowchart
Here’s a simplified flowchart of the overall process:
graph TD
A[Start] --> B[Load Data]
B --> C[Filter & Preprocess Data]
C --> D[Generate Lagged Data]
D --> E[Train Model]
E --> F[Evaluate Model]
F --> G[Log Results]
G --> H[Generate Visualizations]
H --> I[End]
Data Analysis
The data analysis methodology involves loading evaluation results, calculating performance metrics, and aggregating results to identify the best-performing model configurations. Visualization tools are used to provide insights into model performance and key predictors.
Keypoints
Data is normalized to 0-100 scale and split 60/40 for training/testing
Four ML models are available: Linear Regression, Random Forest, MLP, XGBoost
Lagged features capture temporal dependencies between indices
Models are evaluated using correlation coefficients and MAE