Parallelization

Objectives

Understand inner-loop vs function-level parallelization strategies
Use MultiModalMinimizerParallel for multi-core execution
Know the expected speedups on NAIC VMs

Overview

This document describes the parallelization implementation added to the Multi-Modal Optimization (MMO) project to utilize multi-core CPUs efficiently on NAIC VMs.

Implementation Strategy

Inner-Loop Parallelization (Selected Approach)

We implemented inner-loop parallelization within the Seed-Solve-Collect (SSC) algorithm:

Sequential outer loop: Benchmark functions are processed one at a time
Parallel inner loop: Within each function, seed processing is parallelized using joblib
Independent seed solves: Each seed’s local CMA-ES solve runs independently on separate CPU cores

Files Modified/Created

mmo/ssc_parallel.py (NEW)
- Parallelized version of Seed-Solve-Collect loop
- Uses joblib.Parallel with ‘loky’ backend for process-based parallelism
- Processes multiple seeds simultaneously across CPU cores
mmo/minimize_parallel.py (NEW)
- Parallelized version of MultiModalMinimizer
- Adds n_jobs parameter (default: -1 = all cores)
- Drop-in replacement for sequential MultiModalMinimizer
demonstrator.ipynb (MODIFIED)
- Updated imports to include parallel versions
- Modified run_shga_for_function to use MultiModalMinimizerParallel
- Added parallelization explanation markdown cell
- Changed benchmark loop to sequential outer / parallel inner

Performance Results

Test Configuration

System: NAIC VM with 16 CPU cores, NVIDIA L40S GPU
Test functions: F4 (Himmelblau), F5 (Six-Hump Camel), F10 (Modified Rastrigin)
Max iterations: 10 per function (for quick testing)

Comparison: Sequential vs Parallel (Inner-Loop)

Function	Sequential	Parallel	Speedup
F4 (Himmelblau)	30.2s	15.5s	1.95x
F5 (Six-Hump Camel)	13.6s	11.2s	1.21x
F10 (Modified Rastrigin)	195.6s	48.0s	4.08x
Total	239.4s	74.7s	3.20x

Key findings:

Overall speedup: 3.20x on 16 cores
Efficiency: 20.0% (good for embarrassingly parallel workload with overhead)
Best speedup on expensive functions (F10: 4.08x)
All tests achieved 100% peak ratio (correctness maintained)

Why Inner-Loop Beats Function-Level Parallelization

We previously tried parallelizing across functions (running F4, F5, F10 simultaneously), achieving only 1.27x speedup due to workload imbalance:

F10 took 205s while F4/F5 took ~28s each
Most cores idle while F10 runs sequentially
Efficiency: only 7.9%

Inner-loop parallelization solves this:

All 16 cores work on F10’s seeds simultaneously
4.08x speedup on F10 specifically
No idle cores waiting for expensive functions

Technical Details

joblib Configuration

Parallel(n_jobs=-1, backend='loky')(
    delayed(process_seed_parallel)(...) for seed in seeds
)

n_jobs=-1: Use all available CPU cores
backend='loky': Process-based parallelism (avoids GIL, proper isolation)
Each worker process gets a copy of localsolver and function evaluator

Thread Safety Considerations

Function evaluation counting: Each worker maintains its own n_fev counter; counts are aggregated in main process after parallel execution
Solution collection: Solutions are added sequentially after parallel seed processing completes (avoids race conditions)
Budget tracking: Each worker checks budget independently; main process handles final budget enforcement
State isolation: ‘loky’ backend creates separate processes, so no shared state between workers

Usage

In Python Scripts

from mmo.minimize_parallel import MultiModalMinimizerParallel
from mmo.domain import Domain
from cec2013.cec2013 import CEC2013

f = CEC2013(4)  # Himmelblau
dim = f.get_dimension()
lb = [f.get_lbound(k) for k in range(dim)]
ub = [f.get_ubound(k) for k in range(dim)]
domain = Domain(boundary=[lb, ub])

# Use parallel version with all cores
for result in MultiModalMinimizerParallel(
    f=f, domain=domain, budget=50000,
    max_iter=50, verbose=1, n_jobs=-1  # n_jobs=-1 uses all cores
):
    print(f"Found {result.n_sol} solutions")

In Jupyter Notebook

The demonstrator notebook automatically uses the parallel version. Just run cells normally - parallelization happens transparently.

Testing

Quick Test (3 functions)

cd /home/ubuntu/wp7-UC6-multimodal-optimization
source activate-mmo.sh
python test_parallel_comparison.py

Runs F4, F5, F10 with both sequential and parallel versions for comparison (~5 minutes).

Full Benchmark (F4-F14)

python test_full_benchmark_parallel.py

Runs all 2D+ CEC2013 functions (F4-F14) with parallel SHGA (~15-20 minutes).

Expected Performance

On a 16-core NAIC VM:

Simple 2D functions (F4, F5): 1.2-2.0x speedup
Complex functions (F10, F6, F7): 3.0-4.5x speedup
Overall benchmark: 3.0-3.5x speedup
Efficiency: 18-25% (accounting for serial portions and overhead)

Limitations

Not GPU-accelerated: Current implementation uses only CPU cores; GPU is not utilized
Serial GA phase: Initial population seeding with Genetic Algorithm is still sequential
Diminishing returns: Speedup plateaus beyond ~16-32 cores due to Amdahl’s Law
Memory overhead: Each worker process duplicates some state (acceptable for this workload)

Future Improvements

Potential areas for further optimization:

GPU-accelerated function evaluations for expensive benchmarks
Parallelize the GA population evolution phase
Adaptive n_jobs based on seed count (use fewer cores when seeds are scarce)
Shared memory backend for reduced memory footprint on large problems

References

Original SHGA paper: Johannsen et al. (2022) - “A scalable, hybrid genetic algorithm for continuous multimodal optimization in moderate dimensions”
joblib documentation: https://joblib.readthedocs.io/
CEC2013 benchmark suite: https://github.com/mikeagn/CEC2013

Keypoints

Inner-loop parallelization (parallel seed solves) outperforms function-level parallelization
Use MultiModalMinimizerParallel as a drop-in replacement with n_jobs=-1
Expected 3-4x speedup on 16-core NAIC VMs for complex functions
Correctness is preserved: all tests achieve the same peak ratios
The GA phase remains sequential; only CMA-ES seed solves are parallelized