A glimpse into a world of 2D particles (Part 2)

Welcome to part 2! In the first part of this series, we modeled a 2D particle system from scratch and built a basic simulation, squeezing out performance wins with loop-level optimizations targeting the physics hot loop. By the end we could simulate around 2,000 particles at 60fps on a modern machine. Our force computation however, was still , meaning every particle checked every other particle, in other words, doubling the number of particles meant quadrupling the work.

In this post we go further. We will break the barrier with spatial partitioning and move physics off the main thread using Web Workers. We also wrap the whole simulation in a predictable and modular architecture using a finite state machine and pluggable rendering backends. By the end we will be able to simulate 10k particles at 60fps on an iPhone 13 pro.

Here's links for quick navigation to each section but I recommend reading this post in order like a short story to see the code and explanations for each optimization, including the intent behind each change. The Scaling Problem Spatial Partitioning Web Workers Component Architecture GPU Compute with WebGPU Results & Interactive Demo

The Scaling Problem

Let's quickly recap where we left off. In part 1 we applied several loop-level optimizations, including a pair-once loop (shared distance computation), avoiding object allocations, and inlining math, and reached around 2,000 particles at 60fps. Assuming particles, the fundamental bottleneck was the force computation. We check pairs per frame. That is .

Particles only interact when they are within a cutoff distance . Any pair farther than apart contributes zero force. In a typical simulation, most pairs are far apart. We are doing enormous amounts of work computing forces that turn out to be zero.

Spatial partitioning

The trick to breaking the barrier is to avoid checking pairs that are obviously too far apart. We do this by dividing the simulation space into a uniform grid of cells, where each cell has side length equal to the cutoff distance:

Since particles only interact within , a particle in cell can only be affected by particles in the same cell and the 8 adjacent cells, i.e. a 3x3 neighbourhood. Instead of checking all particles, we only check the (usually small) set of particles in 9 cells.

This reduces the force computation from to , where is the average number of particles in a 3x3 neighbourhood. For most simulations , so the speedup is dramatic, typically 5-30x for 10,000 particles.

The uniform grid concept

The simplest spatial data structure for this problem is a uniform grid (also called a hash grid). Every frame, we:

Build the grid by placing each particle into the cell that contains it
Query the grid to find all particles in a 3x3 neighbourhood when computing forces

A first attempt might use a Map where keys are cell coordinates and values are arrays of particle indices:

This works and is easy to understand, but has a few problems for a hot loop running 60 times per second:

Map lookup overhead, where string key hashing is expensive compared to array indexing
Array allocations, as push and the spread operator (...cell) create garbage every frame
GC pressure, since all those short-lived arrays trigger garbage collection pauses causing frame stutters

Building the grid: counting sort

For the production version, we use a technique from sorting algorithms: counting sort. Instead of a Map, we use three flat arrays, allocated once and reused every frame:

gridCellCount[numCells], containing how many particles are in each cell
gridCellStart[numCells + 1], a prefix sum giving the start index for each cell in the particle list
gridCellParticles[numParticles], a flat list of particle indices grouped by cell

The algorithm runs in two passes over all particles, making it :

Note how arrays are only reallocated when the grid grows (with a 1.5x over-allocation factor). They never shrink. This is important because the cutoff distance can be animated (smoothly changing every frame), which changes the grid dimensions. Without this strategy we would be allocating new arrays on nearly every frame.

Querying neighbours

With the grid built, the force loop changes from "for each particle, check all other particles" to "for each particle, check only particles in 9 neighbouring cells". For reflecting boundaries (where particles bounce off the edges):

The j <= i check ensures each pair is visited only once, the same optimization from part 1. The key saving is in the distance computation. One Math.sqrt call is shared across both directions rather than computed twice. We still look up two separate entries from the force matrix per pair, one for the force on i (using G[typeI][typeJ]) and one for the force on j (using G[typeJ][typeI]). As part 1 explains, the force matrix is not necessarily symmetric, so these magnitudes can differ. Forces point in opposite directions but are not in general equal in magnitude.

Toroidal boundaries

Periodic boundaries (also called toroidal boundaries) make the simulation space wrap around. A particle that exits the right edge re-enters from the left. This creates a seamless, infinite-feeling world with no walls. It adds two complications for the spatial grid:

Wrapped cell indices, where cells at the edge of the grid must check cells on the opposite edge. We use modular arithmetic instead of clamping, i.e. (cxi + dcx + gridWidth) % gridWidth
Shortest-path distance. When computing the distance between two particles, we need to account for wrapping. The correction, where $W$ is the simulation width (and $H$ its height for the y-direction), is:
This ensures we always use the shortest path around the torus, not the long way around.

There is also a subtle edge case. When the grid has fewer than 3 cells in a dimension, the wrapped cell indices can repeat. For example, if the grid is only 2 cells wide, cell index (0 - 1 + 2) % 2 = 1 and (0 + 1 + 2) % 2 = 1, the same cell. Without deduplication, we would apply forces for the same pair twice. We use a small fixed-size buffer to track which cells have already been visited:

j is within interaction range of i through the boundary, but only when the shortest-path correction is applied

Web Workers

Why off-thread physics?

Even with spatial partitioning, 10,000+ particles can take several milliseconds of physics computation per frame. If that runs on the main thread, it competes with rendering and UI event handling (clicks, scrolls, hover effects). The result is stuttery animation and an unresponsive page.

A Web Worker runs JavaScript in a separate OS thread. We can move the entire physics update, including building the grid, computing forces, and integrating motion, into a worker, leaving the main thread free to render and handle events. The main thread draws the last known particle state while the worker computes the next one.

ArrayBuffer transfer (zero-copy)

The particle data for 10,000 particles is a Float32Array of 60,000 floats (6 floats per particle, including position, velocity, type, and damping):

Copying 240KB of data between threads on every frame would be expensive. Instead, we use the Transferable API to transfer ownership of the underlying ArrayBuffer. This is a zero-copy operation, meaning the memory itself does not move, only the ownership changes. The sender's reference becomes detached (unusable) until the receiver sends it back.

This pattern creates a natural ping-pong. The main thread owns the buffer, sends it to the worker, waits for the worker to finish, gets it back, uploads to the GPU, then sends it to the worker again.

The worker message loop

Inside the worker, the message handler is straightforward. It receives data, runs physics, measures time, and sends results back:

The main thread sets up the worker and handles responses:

One important detail. The rendering loop on the main thread runs independently of the worker. It keeps drawing the last known particle positions at 60fps regardless of how long the worker takes. If the worker takes longer than one frame (16.7ms), the simulation effectively runs at a lower tick rate, but the animation remains smooth. The user never sees a frozen screen.

Architecture

With spatial partitioning for performance and a Web Worker for responsiveness, we have the building blocks of a production simulation, but the code also needs to be maintainable. In this section we look at the architectural patterns that keep the system organized and predictable.

The component architecture

The simulation is organized in a three-layer architecture:

Component layer, the Angular component that handles UI controls (play/pause, particle count slider, screenshot button) and binds to inputs
Service layer, using Angular Signals for reactive state and wrapping the lifecycle in a finite state machine
Renderer + Backend layer, covering the animation loop, performance tracking, and a pluggable backend interface that encapsulates all rendering and physics

The key abstraction is ISimulationBackend. Both the WebGL backend (which uses a Web Worker for CPU physics and WebGL2 for rendering) and the WebGPU backend (which runs everything on the GPU via compute shaders) implement the same interface:

The renderer picks the best available backend at startup (WebGPU if supported, WebGL2 as fallback) and delegates all work through this interface. No if/else branches, no feature flags. Adding a new backend in the future requires no changes to the renderer or component.

The state machine

A particle simulation has a complex lifecycle. It needs to initialize (possibly async), allocate buffers, idle, run, pause, handle color changes, and clean up on destroy. Without careful management, it is easy to end up with boolean flags like isRunning, isInitialized, isPaused that interact in surprising ways.

Instead, we use a generic finite state machine (FSM). The FSM is defined separately from the particle simulation. It is a reusable utility class that works with any set of states:

A few things to notice about this design:

Closed state space, where the type parameter T extends string constrains states to a fixed union type. You cannot accidentally transition to an undefined state.
Open event space, where events are plain strings, so you can add new events without modifying the FSM class itself.
Guard conditions, where transitions can have guard functions that must return true for the transition to proceed. This prevents invalid transitions at runtime.
Async callbacks, as onEnter and onExit can return promises, supporting async initialization.

For the particle simulation, we define 7 states with their transitions:

State diagram:

Every lifecycle operation (play(), pause(), stop(), resetSimulationData()) is just a call to fsm.send(eventName). If the transition is not valid from the current state, nothing happens. This makes the system predictable. You never end up in an impossible state, and you never need to check multiple boolean flags.

Backend abstraction in practice

WebGLBackend.submitFrame() is called 60 times per second by the renderer.

Notice how this method does two independent things. It dispatches physics to the worker (if the worker is not already busy), and it draws the last known state to the screen. The worker is fire-and-forget, so when it finishes, its onmessage handler copies the updated positions into the draw buffer, ready for the next render pass.

The WebGPU backend has the same submitFrame signature but a completely different implementation. It dispatches compute shaders on the GPU and uses ping-pong buffers instead of a worker. The renderer does not care, it calls the same interface.

GPU compute with WebGPU

The CPU path is essentially saturated. Even with spatial partitioning and a Web Worker, force computation is still sequential on a single thread. A modern GPU has thousands of shader cores sitting idle while the worker runs through the physics. The WebGPU backend moves the entire physics pipeline onto the GPU. The spatial grid build, force accumulation, and motion integration all run as compute shaders with no Web Worker involved.

Particle data lives in GPU storage buffers and never leaves the device each frame. Instead of a shared Float32Array passed between threads, we use a Structure-of-Arrays layout: separate ping-pong buffers for positions (vec2f) and velocities (vec2f), plus immutable per-particle buffers for type (u32), alpha, and dampFactor (precomputed pow(damping, dt)). Add a force accumulation buffer and three grid buffers (cellCount, cellStart, cellParticles), and the entire simulation state lives in GPU memory. Each frame dispatches six compute passes in sequence:

CLEAR, which zeros the cellCount buffer via clearBuffer
COUNT, one thread per particle, atomically incrementing its cell counter
PREFIX_SUM, a single thread converting counts to CSR start offsets and resetting cellCount for reuse
SCATTER, one thread per particle, writing its index into cellParticles
FORCE, one thread per particle, walking the 3×3 neighbourhood and accumulating forces
INTEGRATE, one thread per particle, applying forces and advancing position

The first four passes rebuild the spatial grid. They are the exact same three-pass counting-sort from the Web Worker, just written in WGSL (WebGPU Shading Language) and running on thousands of cores instead of one. The COUNT pass uses atomicAdd because multiple particles can land in the same cell simultaneously:

The FORCE pass follows the same 3×3 neighbourhood walk and 9-slot dedup buffer as the CPU worker. The loop structure differs in one important way, though: each thread processes all j != i, not just j > i:

This is a fundamental structural difference from the CPU worker, not just a translation to WGSL. Here is a side-by-side comparison of the two loop strategies:

CPU Worker

GPU FORCE Pass

	CPU Worker	GPU FORCE Pass
Loop condition	`j > i` (pair-once)	`j != i` (all neighbours)
Matrix lookups per pair	2, one per direction	1, own row only
Force accumulated for	both particles i and j	particle i only
How j's force gets computed	same thread, same iteration	independently by thread j
Pairs visited	once	twice, each into a separate buffer (no double-counting)

The CPU worker uses the pair-once trick because sequential compute is expensive. Skipping half the force evaluations matters on a single thread. Writing to both netForce[i] and netForce[j] in the same sequential loop is safe.

On the GPU, each of the threads exclusively owns its own local accumulation registers (fx, fy) and writes only to forces[i], its own slot. No other thread touches forces[i]. This means thread i examining pair (i, j) and thread j examining the same pair are computing two different quantities. Thread i computes the force on i from j, and thread j computes the force on j from i. Each goes into a separate buffer, so there is no double-counting and no division by two is needed. Writing to a neighbour's slot from multiple threads would require atomic floating-point additions, which are expensive and not natively supported for f32 in WGSL. Letting each thread own only its result avoids the problem entirely. Both threads independently compute the same distance, but since they run in parallel, this does not add to wall-clock latency.

The submitFrame method is simpler than the WebGL version. It encodes all six dispatches into a single command buffer, submits it to the GPU queue, and returns immediately. Timing is measured asynchronously via onSubmittedWorkDone:

One subtlety is back-pressure. The GPU command queue is not infinite. If we encode a new frame before the previous compute pass has finished, the queue grows unbounded. The isComputePending flag guards against this. When the previous frame has not completed yet, we skip the physics update and only redraw with the last known particle positions, keeping the queue bounded.

The other subtlety is timing. submitFrame returns the previous frame's GPU time immediately, so the renderer's requestAnimationFrame loop is never blocked waiting for a GPU fence. The one-frame lag is invisible to the user but keeps the main thread free.

Because force computation is now distributed across GPU cores rather than executed sequentially, the ceiling rises noticeably over the CPU path. On an iPhone 13 Pro, WebGPU holds 10,000 particles at a stable 60 fps where the WebGL path would already be struggling. On an M4 Pro Mac Mini, 25,000 particles run stably with WebGPU, and the simulation can reach 50,000 before the frame budget is exceeded.

Results & interactive demo

Below you can see both backends side by side at the same particle count (N = 5k, periodic boundaries). The left forces the WebGL + Web Worker path so physics runs on the CPU. The right auto-selects the best available backend, choosing WebGPU where your browser supports it. Look at the physics time reported in the stats overlay to compare the two. Try tapping each to pause or resume.

N = 5k (WebGL + Worker, CPU physics)

N = 5k (WebGPU Compute/Fragment Shaders)

Full-scale demo with all controls available. Try pushing the particle count above N = 10k:

On an M4 Pro Mac Mini the WebGL + Worker backend handles 10,000–25,000 particles at 60 fps. WebGPU on the same machine holds a stable 25,000 particles, reaching 50,000 before the frame budget is exceeded. On an iPhone 13 Pro the WebGL path reaches a few thousand particles. The WebGPU path holds a stable 10,000. These are roughly the limits of the current architecture given the complexity. The demo above auto-selects the best available backend.

The key takeaways:

Spatial partitioning broke the barrier, giving roughly a 10× improvement in the physics step by reducing it to
Web Workers decoupled physics from rendering, keeping the UI smooth even when physics takes longer than one frame
The state machine eliminated an entire class of bugs, including impossible states and race conditions during initialization, and made the lifecycle explicit
Backend abstraction let us add a WebGPU path without touching the renderer or component
WebGPU compute shaders parallelised the spatial grid and force computation across thousands of GPU cores, raising the stable ceiling to 25,000 on desktop and 10,000 on mobile

Wrapping up

We went from ~2,000 particles at 60 fps (Part 1) to ~25,000 stably with WebGPU on desktop, through a chain of independent improvements. Spatial partitioning broke the barrier. Web Workers kept the UI smooth. The FSM made the lifecycle predictable. The backend interface made the rendering path swappable. WebGPU moved the entire physics pipeline onto the GPU. Each change was self-contained and could be applied independently.

There are still directions left to explore, such as a true parallel GPU prefix-sum using workgroup shared memory, 3D extensions, or real-time force-matrix editing with visual feedback. The fundamentals are here. If you want to dig into the full source, the simulation runs directly on this page and in the Sketches section.