# Parallelization

```{objectives}
- Understand inner-loop vs function-level parallelization strategies
- Use MultiModalMinimizerParallel for multi-core execution
- Know the expected speedups on NAIC VMs
```

## Overview

This document describes the parallelization implementation added to the Multi-Modal Optimization (MMO) project to utilize multi-core CPUs efficiently on NAIC VMs.

## Implementation Strategy

### Inner-Loop Parallelization (Selected Approach)

We implemented **inner-loop parallelization** within the Seed-Solve-Collect (SSC) algorithm:

- **Sequential outer loop**: Benchmark functions are processed one at a time
- **Parallel inner loop**: Within each function, seed processing is parallelized using `joblib`
- **Independent seed solves**: Each seed's local CMA-ES solve runs independently on separate CPU cores

### Files Modified/Created

1. **mmo/ssc_parallel.py** (NEW)
   - Parallelized version of Seed-Solve-Collect loop
   - Uses `joblib.Parallel` with 'loky' backend for process-based parallelism
   - Processes multiple seeds simultaneously across CPU cores

2. **mmo/minimize_parallel.py** (NEW)
   - Parallelized version of MultiModalMinimizer
   - Adds `n_jobs` parameter (default: -1 = all cores)
   - Drop-in replacement for sequential MultiModalMinimizer

3. **demonstrator.ipynb** (MODIFIED)
   - Updated imports to include parallel versions
   - Modified `run_shga_for_function` to use MultiModalMinimizerParallel
   - Added parallelization explanation markdown cell
   - Changed benchmark loop to sequential outer / parallel inner

## Performance Results

### Test Configuration
- **System**: NAIC VM with 16 CPU cores, NVIDIA L40S GPU
- **Test functions**: F4 (Himmelblau), F5 (Six-Hump Camel), F10 (Modified Rastrigin)
- **Max iterations**: 10 per function (for quick testing)

### Comparison: Sequential vs Parallel (Inner-Loop)

| Function | Sequential | Parallel | Speedup |
|----------|------------|----------|---------|
| F4 (Himmelblau) | 30.2s | 15.5s | 1.95x |
| F5 (Six-Hump Camel) | 13.6s | 11.2s | 1.21x |
| F10 (Modified Rastrigin) | 195.6s | 48.0s | **4.08x** |
| **Total** | **239.4s** | **74.7s** | **3.20x** |

**Key findings**:
- Overall speedup: **3.20x** on 16 cores
- Efficiency: **20.0%** (good for embarrassingly parallel workload with overhead)
- Best speedup on expensive functions (F10: 4.08x)
- All tests achieved **100% peak ratio** (correctness maintained)

### Why Inner-Loop Beats Function-Level Parallelization

We previously tried parallelizing across functions (running F4, F5, F10 simultaneously), achieving only **1.27x speedup** due to workload imbalance:
- F10 took 205s while F4/F5 took ~28s each
- Most cores idle while F10 runs sequentially
- Efficiency: only 7.9%

Inner-loop parallelization solves this:
- All 16 cores work on F10's seeds simultaneously
- 4.08x speedup on F10 specifically
- No idle cores waiting for expensive functions

## Technical Details

### joblib Configuration
```python
Parallel(n_jobs=-1, backend='loky')(
    delayed(process_seed_parallel)(...) for seed in seeds
)
```

- `n_jobs=-1`: Use all available CPU cores
- `backend='loky'`: Process-based parallelism (avoids GIL, proper isolation)
- Each worker process gets a copy of localsolver and function evaluator

### Thread Safety Considerations

1. **Function evaluation counting**: Each worker maintains its own `n_fev` counter; counts are aggregated in main process after parallel execution

2. **Solution collection**: Solutions are added sequentially after parallel seed processing completes (avoids race conditions)

3. **Budget tracking**: Each worker checks budget independently; main process handles final budget enforcement

4. **State isolation**: 'loky' backend creates separate processes, so no shared state between workers

## Usage

### In Python Scripts
```python
from mmo.minimize_parallel import MultiModalMinimizerParallel
from mmo.domain import Domain
from cec2013.cec2013 import CEC2013

f = CEC2013(4)  # Himmelblau
dim = f.get_dimension()
lb = [f.get_lbound(k) for k in range(dim)]
ub = [f.get_ubound(k) for k in range(dim)]
domain = Domain(boundary=[lb, ub])

# Use parallel version with all cores
for result in MultiModalMinimizerParallel(
    f=f, domain=domain, budget=50000,
    max_iter=50, verbose=1, n_jobs=-1  # n_jobs=-1 uses all cores
):
    print(f"Found {result.n_sol} solutions")
```

### In Jupyter Notebook
The demonstrator notebook automatically uses the parallel version. Just run cells normally - parallelization happens transparently.

## Testing

### Quick Test (3 functions)
```bash
cd /home/ubuntu/wp7-UC6-multimodal-optimization
source activate-mmo.sh
python test_parallel_comparison.py
```

Runs F4, F5, F10 with both sequential and parallel versions for comparison (~5 minutes).

### Full Benchmark (F4-F14)
```bash
python test_full_benchmark_parallel.py
```

Runs all 2D+ CEC2013 functions (F4-F14) with parallel SHGA (~15-20 minutes).

## Expected Performance

On a 16-core NAIC VM:
- **Simple 2D functions** (F4, F5): 1.2-2.0x speedup
- **Complex functions** (F10, F6, F7): 3.0-4.5x speedup
- **Overall benchmark**: 3.0-3.5x speedup
- **Efficiency**: 18-25% (accounting for serial portions and overhead)

## Limitations

1. **Not GPU-accelerated**: Current implementation uses only CPU cores; GPU is not utilized
2. **Serial GA phase**: Initial population seeding with Genetic Algorithm is still sequential
3. **Diminishing returns**: Speedup plateaus beyond ~16-32 cores due to Amdahl's Law
4. **Memory overhead**: Each worker process duplicates some state (acceptable for this workload)

## Future Improvements

Potential areas for further optimization:
1. GPU-accelerated function evaluations for expensive benchmarks
2. Parallelize the GA population evolution phase
3. Adaptive n_jobs based on seed count (use fewer cores when seeds are scarce)
4. Shared memory backend for reduced memory footprint on large problems

## References

- Original SHGA paper: Johannsen et al. (2022) - "A scalable, hybrid genetic algorithm for continuous multimodal optimization in moderate dimensions"
- joblib documentation: https://joblib.readthedocs.io/
- CEC2013 benchmark suite: https://github.com/mikeagn/CEC2013

```{keypoints}
- Inner-loop parallelization (parallel seed solves) outperforms function-level parallelization
- Use `MultiModalMinimizerParallel` as a drop-in replacement with `n_jobs=-1`
- Expected 3-4x speedup on 16-core NAIC VMs for complex functions
- Correctness is preserved: all tests achieve the same peak ratios
- The GA phase remains sequential; only CMA-ES seed solves are parallelized
```