Parallelization
Objectives
Understand inner-loop vs function-level parallelization strategies
Use MultiModalMinimizerParallel for multi-core execution
Know the expected speedups on NAIC VMs
Overview
This document describes the parallelization implementation added to the Multi-Modal Optimization (MMO) project to utilize multi-core CPUs efficiently on NAIC VMs.
Implementation Strategy
Inner-Loop Parallelization (Selected Approach)
We implemented inner-loop parallelization within the Seed-Solve-Collect (SSC) algorithm:
Sequential outer loop: Benchmark functions are processed one at a time
Parallel inner loop: Within each function, seed processing is parallelized using
joblibIndependent seed solves: Each seed’s local CMA-ES solve runs independently on separate CPU cores
Files Modified/Created
mmo/ssc_parallel.py (NEW)
Parallelized version of Seed-Solve-Collect loop
Uses
joblib.Parallelwith ‘loky’ backend for process-based parallelismProcesses multiple seeds simultaneously across CPU cores
mmo/minimize_parallel.py (NEW)
Parallelized version of MultiModalMinimizer
Adds
n_jobsparameter (default: -1 = all cores)Drop-in replacement for sequential MultiModalMinimizer
demonstrator.ipynb (MODIFIED)
Updated imports to include parallel versions
Modified
run_shga_for_functionto use MultiModalMinimizerParallelAdded parallelization explanation markdown cell
Changed benchmark loop to sequential outer / parallel inner
Performance Results
Test Configuration
System: NAIC VM with 16 CPU cores, NVIDIA L40S GPU
Test functions: F4 (Himmelblau), F5 (Six-Hump Camel), F10 (Modified Rastrigin)
Max iterations: 10 per function (for quick testing)
Comparison: Sequential vs Parallel (Inner-Loop)
Function |
Sequential |
Parallel |
Speedup |
|---|---|---|---|
F4 (Himmelblau) |
30.2s |
15.5s |
1.95x |
F5 (Six-Hump Camel) |
13.6s |
11.2s |
1.21x |
F10 (Modified Rastrigin) |
195.6s |
48.0s |
4.08x |
Total |
239.4s |
74.7s |
3.20x |
Key findings:
Overall speedup: 3.20x on 16 cores
Efficiency: 20.0% (good for embarrassingly parallel workload with overhead)
Best speedup on expensive functions (F10: 4.08x)
All tests achieved 100% peak ratio (correctness maintained)
Why Inner-Loop Beats Function-Level Parallelization
We previously tried parallelizing across functions (running F4, F5, F10 simultaneously), achieving only 1.27x speedup due to workload imbalance:
F10 took 205s while F4/F5 took ~28s each
Most cores idle while F10 runs sequentially
Efficiency: only 7.9%
Inner-loop parallelization solves this:
All 16 cores work on F10’s seeds simultaneously
4.08x speedup on F10 specifically
No idle cores waiting for expensive functions
Technical Details
joblib Configuration
Parallel(n_jobs=-1, backend='loky')(
delayed(process_seed_parallel)(...) for seed in seeds
)
n_jobs=-1: Use all available CPU coresbackend='loky': Process-based parallelism (avoids GIL, proper isolation)Each worker process gets a copy of localsolver and function evaluator
Thread Safety Considerations
Function evaluation counting: Each worker maintains its own
n_fevcounter; counts are aggregated in main process after parallel executionSolution collection: Solutions are added sequentially after parallel seed processing completes (avoids race conditions)
Budget tracking: Each worker checks budget independently; main process handles final budget enforcement
State isolation: ‘loky’ backend creates separate processes, so no shared state between workers
Usage
In Python Scripts
from mmo.minimize_parallel import MultiModalMinimizerParallel
from mmo.domain import Domain
from cec2013.cec2013 import CEC2013
f = CEC2013(4) # Himmelblau
dim = f.get_dimension()
lb = [f.get_lbound(k) for k in range(dim)]
ub = [f.get_ubound(k) for k in range(dim)]
domain = Domain(boundary=[lb, ub])
# Use parallel version with all cores
for result in MultiModalMinimizerParallel(
f=f, domain=domain, budget=50000,
max_iter=50, verbose=1, n_jobs=-1 # n_jobs=-1 uses all cores
):
print(f"Found {result.n_sol} solutions")
In Jupyter Notebook
The demonstrator notebook automatically uses the parallel version. Just run cells normally - parallelization happens transparently.
Testing
Quick Test (3 functions)
cd /home/ubuntu/wp7-UC6-multimodal-optimization
source activate-mmo.sh
python test_parallel_comparison.py
Runs F4, F5, F10 with both sequential and parallel versions for comparison (~5 minutes).
Full Benchmark (F4-F14)
python test_full_benchmark_parallel.py
Runs all 2D+ CEC2013 functions (F4-F14) with parallel SHGA (~15-20 minutes).
Expected Performance
On a 16-core NAIC VM:
Simple 2D functions (F4, F5): 1.2-2.0x speedup
Complex functions (F10, F6, F7): 3.0-4.5x speedup
Overall benchmark: 3.0-3.5x speedup
Efficiency: 18-25% (accounting for serial portions and overhead)
Limitations
Not GPU-accelerated: Current implementation uses only CPU cores; GPU is not utilized
Serial GA phase: Initial population seeding with Genetic Algorithm is still sequential
Diminishing returns: Speedup plateaus beyond ~16-32 cores due to Amdahl’s Law
Memory overhead: Each worker process duplicates some state (acceptable for this workload)
Future Improvements
Potential areas for further optimization:
GPU-accelerated function evaluations for expensive benchmarks
Parallelize the GA population evolution phase
Adaptive n_jobs based on seed count (use fewer cores when seeds are scarce)
Shared memory backend for reduced memory footprint on large problems
References
Original SHGA paper: Johannsen et al. (2022) - “A scalable, hybrid genetic algorithm for continuous multimodal optimization in moderate dimensions”
joblib documentation: https://joblib.readthedocs.io/
CEC2013 benchmark suite: https://github.com/mikeagn/CEC2013
Keypoints
Inner-loop parallelization (parallel seed solves) outperforms function-level parallelization
Use
MultiModalMinimizerParallelas a drop-in replacement withn_jobs=-1Expected 3-4x speedup on 16-core NAIC VMs for complex functions
Correctness is preserved: all tests achieve the same peak ratios
The GA phase remains sequential; only CMA-ES seed solves are parallelized