3. Multiprocessing
Starting in v0.4.12, PyAFV provides
pyafv.ParallelFiniteVoronoiSimulator for domain-decomposed AFV
simulations using Python multiprocessing (CPU parallelism). The
simulator splits the full point set into rectangular owned domains, adds halo
points around each domain, builds a local finite Voronoi diagram for each
subdomain, and merges the owned-cell diagnostics back into the global point
ordering.
This feature is intended for large systems (\(N \gtrsim 10^4\)) where the
cost of local Voronoi builds is high enough to offset the overhead of domain
decomposition, inter-process data transfer, and duplicated halo work.
The crossover depends on hardware, point density, and the
chosen domain grid. For small systems, pyafv.FiniteVoronoiSimulator
may still be faster.
Note
See Benchmarking parallel build for a build-time benchmark comparing
pyafv.FiniteVoronoiSimulator with
pyafv.ParallelFiniteVoronoiSimulator. The benchmark shows that
multiprocessing is generally faster once the system is not too small,
especially for large systems.
3.1. Basic usage
The interface is similar to pyafv.FiniteVoronoiSimulator, but the
domain grid shape and number of worker processes are supplied when the
simulator is created:
import numpy as np
import pyafv
points = np.random.default_rng(42).random((10_000, 2)) * 100.0
phys = pyafv.PhysicalParams(r=1.0)
sim = pyafv.ParallelFiniteVoronoiSimulator(
points,
phys,
grid_shape=(4, 4),
n_workers=16,
)
diag = sim.build()
PyAFV decomposes the domain into an a-by-b grid of subdomains set by
grid_shape=(a, b). The number of subdomains is therefore \(ab\).
n_workers is the number of worker processes to use.
In practice, set n_workers to the number of CPU cores available to the
Python job, but no larger than the number of subdomains; additional workers
would remain idle anyway.
By default, pyafv.ParallelFiniteVoronoiSimulator.build() uses
connect=False. This differs from
pyafv.FiniteVoronoiSimulator.build(), where connect=True by
default. This default avoids connectivity work during runs where only forces
are needed.
Tip
Decomposing the whole system into smaller domains can also improve the
accuracy of scipy.spatial.Voronoi for large systems, since
Qhull’s floating-point tolerance scales with the system span; see
issue #38.
3.2. Repeated build steps
For repeated calls with n_workers > 1, put the time-stepping loop inside
the context manager. This creates the worker processes once and reuses them
across build steps:
dt = 0.01
n_steps = 100
with sim:
for step in range(n_steps):
diag = sim.build()
points += diag["forces"] * dt
sim.update_positions(points)
If the context manager is not used, each call to build creates and shuts
down a new process pool. That is usually slower in a loop.
Important
When using multiprocessing in a Python script, put the executable code behind the standard Python guard:
def main():
# Initialize points, phys, and n_steps here.
sim = pyafv.ParallelFiniteVoronoiSimulator(points, phys, (4, 4), 16)
with sim:
for step in range(n_steps):
diag = sim.build()
# followed by time-stepping code...
if __name__ == "__main__":
main()
This guard is required when Python uses the spawn multiprocessing start
method. This includes Windows and modern macOS by default; Linux
usually defaults to fork, but the guard is still recommended for
portable scripts.
In Jupyter notebooks, the parallel simulator may still work without this guard, but long production runs are usually more robust when launched from a script.
3.3. Halo width
Each owned domain is expanded by halo_width in every direction before the
local Voronoi calculation is built. If halo_width is not specified, PyAFV
uses 4.01 * phys.r (\(>4\ell\)). This should be large enough that the
geometry and force for an owned cell are not affected by missing neighboring
cells outside the local domain.
3.4. Decomposition method
The low-level helper pyafv.decompose_points() and the parallel
simulator both support three halo-collection methods:
sim = pyafv.ParallelFiniteVoronoiSimulator(
points,
phys,
grid_shape=(4, 4),
n_workers=16,
decomposition_method="dense",
)
"dense"is the default. It builds a dense domain-by-point mask and is often faster for moderate systems."binned"reuses the owned-domain bins to check fewer candidate halo points. It can be faster for large systems with many domains."sorted_x"avoids the dense temporary mask by sorting points along thex-axis and querying candidate halo ranges. It uses less temporary memory, but can be slower for typical moderate-sized systems.
3.5. Visualization
Parallel plotting diagnostics are local to each domain and should be requested explicitly:
import matplotlib.pyplot as plt
diag = sim.build(plot_mode=True)
fig, ax = plt.subplots()
pyafv.visualize_2d_parallel(points, diag, r=phys.r, ax=ax)
plt.show()
Use pyafv.visualize_2d_parallel() for diagnostics from
pyafv.ParallelFiniteVoronoiSimulator.build(); plot_mode must be
set to True. Use pyafv.visualize_2d() for diagnostics from
pyafv.FiniteVoronoiSimulator.build().
3.6. A complete example
The following code provides a complete example that simulates 10,000 cells
using a 3 x 3 domain decomposition and 9 worker processes:
import numpy as np
import pyafv as afv
from tqdm import tqdm
import matplotlib.pyplot as plt
def main():
radius = 1.0
points = np.random.default_rng(42).random((10_000, 2)) * 100.0
phys = afv.PhysicalParams(r=radius)
sim = afv.ParallelFiniteVoronoiSimulator(points, phys, (3, 3), n_workers=9)
dt = 0.01
with sim:
for _ in tqdm(range(1000)):
diag = sim.build()
points += diag["forces"] * dt
sim.update_positions(points)
diag = sim.build(plot_mode=True)
fig, ax = plt.subplots()
afv.visualize_2d_parallel(points, diag, r=radius, ax=ax)
plt.show()
if __name__ == "__main__":
main()
Run the code as you would a standard Python script. If your system has 9 or more CPU cores available, the parallel implementation should run substantially faster than the single-process version.
3.7. Running on clusters
3.7.1. Running on a single node
Python multiprocessing runs worker processes on the same node as the main Python process. It does not distribute work across multiple nodes. On a Slurm cluster, use one task with multiple CPUs, for example:
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
Then use the same number of workers in Python:
sim = pyafv.ParallelFiniteVoronoiSimulator(points, phys, (4, 4), 16)
3.7.2. Multi-node parallelism: MPI
For multi-node domain decomposition, use an MPI-based
(Message Passing Interface) implementation instead of
Python multiprocessing. PyAFV does not currently provide an MPI
implementation. However, it exposes the point-based domain-decomposition
function pyafv.decompose_points() as a low-level helper:
- decompose_points(points, grid_shape=(2, 2), halo_width=0.0, *, domain_bounds=None, method='dense')[source]
Decompose points into owned grid domains plus halo/local points.
Points on internal owned-domain boundaries are assigned to exactly one domain by using right-sided binning and clipping at the outermost boundary. Halo/local domains include points on all halo box edges.
Tip
This function is independent of the finite Voronoi model. It only computes global/local point-index bookkeeping.
- Parameters:
points (ndarray) – (N,2) array of point coordinates.
grid_shape (tuple[int, int]) – Number of owned domains in the x and y directions.
halo_width (float) – Width added to each side of every owned domain to collect local halo points.
domain_bounds (tuple[tuple[float, float], tuple[float, float]] | None) – Optional domain bounds as
((xmin, xmax), (ymin, ymax)). If None, bounds are inferred from points.method (Literal['dense', 'binned', 'sorted_x']) – Method used to collect halo points.
"dense"builds a dense domain-by-point mask and is usually faster for moderate systems."binned"reuses the owned-domain bins to reduce the number of candidate points checked for each halo and is usually faster for many domains."sorted_x"uses less temporary memory.
- Returns:
One list of
DomainDecompositionobjects in row-major order, where each element stores the local and global point information for a single domain.- Return type:
- Raises:
ValueError – If points does not have shape (N,2).
ValueError – If points contains non-finite values.
ValueError – If grid_shape, halo_width, domain_bounds, or method is invalid.
Using this function, users can build an MPI wrapper around the parallel
simulator with mpi4py. The
following code shows a minimal example with two MPI ranks. First, the full
system is decomposed into a 2 x 1 grid across MPI ranks. Then, each rank
further decomposes its local domain into 2 x 2 subdomains for
multiprocessing, using four worker processes per rank
(\(2 \times 4 = 8\) workers in total).
# MPI_wrap.py
import numpy as np
import pyafv as afv
def main():
# ========================================================
# Put MPI setup inside main() so spawned multiprocessing
# workers do not import/initialize MPI at top level.
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size() # should be 2 for this test
# ========================================================
radius = 1.0
points = np.random.default_rng(42).random((10_000, 2)) * 100.0
phys = afv.PhysicalParams(r=radius)
# 2x1 domain decomposition for the two MPI ranks
if rank == 0:
domains = afv.decompose_points(points, (2, 1), halo_width=4.01*radius)
domains = comm.bcast(domains if rank == 0 else None, root=0)
# Each rank processes its own domain
domain = domains[rank]
local_points = domain.local_pts
# Each rank creates its own simulator with 4 workers
sim = afv.ParallelFiniteVoronoiSimulator(local_points, phys, (2, 2), n_workers=4)
dt = 0.01
with sim:
diag = sim.build()
# gather diagnostics from all ranks
diag_all = comm.gather(diag, root=0)
# combine diagnostics on rank 0
if rank == 0:
forces = np.zeros_like(points, dtype=float)
areas = np.zeros(points.shape[0], dtype=float)
perimeters = np.zeros(points.shape[0], dtype=float)
# ... add more diagnostics as needed ...
for mpi_domain, diag in zip(domains, diag_all):
owned_local_ids = mpi_domain.owned_local_ids
owned_global_ids = mpi_domain.local_global_ids[owned_local_ids]
forces[owned_global_ids] = diag["forces"][owned_local_ids]
areas[owned_global_ids] = diag["areas"][owned_local_ids]
perimeters[owned_global_ids] = diag["perimeters"][owned_local_ids]
diag_combined = {
"forces": forces,
"areas": areas,
"perimeters": perimeters,
}
# Update points
points += forces * dt
# upate domain decomposition with new points
domains = afv.decompose_points(points, (2, 1), halo_width=4.01*radius)
domains = comm.bcast(domains, root=0)
domain = domains[rank]
local_points = domain.local_pts
sim.update_positions(local_points)
print(f"Rank {rank} finished simulation.")
if __name__ == "__main__":
main()
On Linux clusters, it may be necessary to explicitly set the multiprocessing
start method to spawn for better compatibility with MPI:
import multiprocessing as mp
# ... define main() here ...
if __name__ == "__main__":
mp.set_start_method("spawn", force=True)
main()
To run the above code:
(.venv) $ mpiexec -n 2 python MPI_wrap.py
On a Slurm cluster, request 2 MPI tasks and 4 CPUs per task:
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
mpiexec -n "$SLURM_NTASKS" \
--map-by "slot:PE=$SLURM_CPUS_PER_TASK" \
--bind-to core \
python MPI_wrap.py
See Benchmarking hybrid parallel build: MPI + Python multiprocessing for a benchmark of this approach on the Rockfish HPC cluster at Johns Hopkins University.
Caution
MPI ranks can be placed on different nodes, while the worker processes
created by each rank run on that rank’s node. Therefore,
--cpus-per-task must not exceed the number of CPU cores available
on a single node.
Visualization of MPI-based parallel simulations is also possible. Simply loop over each
rank’s domain and call pyafv.visualize_2d_parallel() using the diagnostics from
each rank (plot_mode=True must be passed to the build method to enable plotting).
if rank == 0: # use only one rank to plot
fig, ax = plt.subplots()
for idx in range(size):
diag = diag_all[idx]
domain = domains[idx]
local_points = domain.local_pts
afv.visualize_2d_parallel(local_points, diag, r=radius, ax=ax,
selected=domain.owned_local_ids) # remove halo points for each rank's domain
ax.set_xlim(-10, 110)
ax.set_ylim(-10, 110)
plt.show()
However, if the number of cells is large enough to require multiple compute nodes (\(N \gtrsim 10^6\)), visualizing the full system is usually not very informative anyway. In such large-scale simulations, the finite Voronoi structures are often difficult to distinguish visually, and it is generally more effective to visualize cells simply as points.