Parallelization

Objectives

  • Understand inner-loop vs function-level parallelization strategies

  • Use MultiModalMinimizerParallel for multi-core execution

  • Know the expected speedups on NAIC VMs

Overview

This document describes the parallelization implementation added to the Multi-Modal Optimization (MMO) project to utilize multi-core CPUs efficiently on NAIC VMs.

Implementation Strategy

Inner-Loop Parallelization (Selected Approach)

We implemented inner-loop parallelization within the Seed-Solve-Collect (SSC) algorithm:

  • Sequential outer loop: Benchmark functions are processed one at a time

  • Parallel inner loop: Within each function, seed processing is parallelized using joblib

  • Independent seed solves: Each seed’s local CMA-ES solve runs independently on separate CPU cores

Files Modified/Created

  1. mmo/ssc_parallel.py (NEW)

    • Parallelized version of Seed-Solve-Collect loop

    • Uses joblib.Parallel with ‘loky’ backend for process-based parallelism

    • Processes multiple seeds simultaneously across CPU cores

  2. mmo/minimize_parallel.py (NEW)

    • Parallelized version of MultiModalMinimizer

    • Adds n_jobs parameter (default: -1 = all cores)

    • Drop-in replacement for sequential MultiModalMinimizer

  3. demonstrator.ipynb (MODIFIED)

    • Updated imports to include parallel versions

    • Modified run_shga_for_function to use MultiModalMinimizerParallel

    • Added parallelization explanation markdown cell

    • Changed benchmark loop to sequential outer / parallel inner

Performance Results

Test Configuration

  • System: NAIC VM with 16 CPU cores, NVIDIA L40S GPU

  • Test functions: F4 (Himmelblau), F5 (Six-Hump Camel), F10 (Modified Rastrigin)

  • Max iterations: 10 per function (for quick testing)

Comparison: Sequential vs Parallel (Inner-Loop)

Function

Sequential

Parallel

Speedup

F4 (Himmelblau)

30.2s

15.5s

1.95x

F5 (Six-Hump Camel)

13.6s

11.2s

1.21x

F10 (Modified Rastrigin)

195.6s

48.0s

4.08x

Total

239.4s

74.7s

3.20x

Key findings:

  • Overall speedup: 3.20x on 16 cores

  • Efficiency: 20.0% (good for embarrassingly parallel workload with overhead)

  • Best speedup on expensive functions (F10: 4.08x)

  • All tests achieved 100% peak ratio (correctness maintained)

Why Inner-Loop Beats Function-Level Parallelization

We previously tried parallelizing across functions (running F4, F5, F10 simultaneously), achieving only 1.27x speedup due to workload imbalance:

  • F10 took 205s while F4/F5 took ~28s each

  • Most cores idle while F10 runs sequentially

  • Efficiency: only 7.9%

Inner-loop parallelization solves this:

  • All 16 cores work on F10’s seeds simultaneously

  • 4.08x speedup on F10 specifically

  • No idle cores waiting for expensive functions

Technical Details

joblib Configuration

Parallel(n_jobs=-1, backend='loky')(
    delayed(process_seed_parallel)(...) for seed in seeds
)
  • n_jobs=-1: Use all available CPU cores

  • backend='loky': Process-based parallelism (avoids GIL, proper isolation)

  • Each worker process gets a copy of localsolver and function evaluator

Thread Safety Considerations

  1. Function evaluation counting: Each worker maintains its own n_fev counter; counts are aggregated in main process after parallel execution

  2. Solution collection: Solutions are added sequentially after parallel seed processing completes (avoids race conditions)

  3. Budget tracking: Each worker checks budget independently; main process handles final budget enforcement

  4. State isolation: ‘loky’ backend creates separate processes, so no shared state between workers

Usage

In Python Scripts

from mmo.minimize_parallel import MultiModalMinimizerParallel
from mmo.domain import Domain
from cec2013.cec2013 import CEC2013

f = CEC2013(4)  # Himmelblau
dim = f.get_dimension()
lb = [f.get_lbound(k) for k in range(dim)]
ub = [f.get_ubound(k) for k in range(dim)]
domain = Domain(boundary=[lb, ub])

# Use parallel version with all cores
for result in MultiModalMinimizerParallel(
    f=f, domain=domain, budget=50000,
    max_iter=50, verbose=1, n_jobs=-1  # n_jobs=-1 uses all cores
):
    print(f"Found {result.n_sol} solutions")

In Jupyter Notebook

The demonstrator notebook automatically uses the parallel version. Just run cells normally - parallelization happens transparently.

Testing

Quick Test (3 functions)

cd /home/ubuntu/wp7-UC6-multimodal-optimization
source activate-mmo.sh
python test_parallel_comparison.py

Runs F4, F5, F10 with both sequential and parallel versions for comparison (~5 minutes).

Full Benchmark (F4-F14)

python test_full_benchmark_parallel.py

Runs all 2D+ CEC2013 functions (F4-F14) with parallel SHGA (~15-20 minutes).

Expected Performance

On a 16-core NAIC VM:

  • Simple 2D functions (F4, F5): 1.2-2.0x speedup

  • Complex functions (F10, F6, F7): 3.0-4.5x speedup

  • Overall benchmark: 3.0-3.5x speedup

  • Efficiency: 18-25% (accounting for serial portions and overhead)

Limitations

  1. Not GPU-accelerated: Current implementation uses only CPU cores; GPU is not utilized

  2. Serial GA phase: Initial population seeding with Genetic Algorithm is still sequential

  3. Diminishing returns: Speedup plateaus beyond ~16-32 cores due to Amdahl’s Law

  4. Memory overhead: Each worker process duplicates some state (acceptable for this workload)

Future Improvements

Potential areas for further optimization:

  1. GPU-accelerated function evaluations for expensive benchmarks

  2. Parallelize the GA population evolution phase

  3. Adaptive n_jobs based on seed count (use fewer cores when seeds are scarce)

  4. Shared memory backend for reduced memory footprint on large problems

References

  • Original SHGA paper: Johannsen et al. (2022) - “A scalable, hybrid genetic algorithm for continuous multimodal optimization in moderate dimensions”

  • joblib documentation: https://joblib.readthedocs.io/

  • CEC2013 benchmark suite: https://github.com/mikeagn/CEC2013

Keypoints

  • Inner-loop parallelization (parallel seed solves) outperforms function-level parallelization

  • Use MultiModalMinimizerParallel as a drop-in replacement with n_jobs=-1

  • Expected 3-4x speedup on 16-core NAIC VMs for complex functions

  • Correctness is preserved: all tests achieve the same peak ratios

  • The GA phase remains sequential; only CMA-ES seed solves are parallelized