A glimpse into a world of 2D particles (Part 2)
Welcome to part 2! In the first part of this series, we modeled a 2D particle system from scratch and built a basic simulation, squeezing out performance wins with loop-level optimizations targeting the physics hot loop. By the end we could simulate around 2,000 particles at 60fps on a modern machine. Our force computation however, was still
In this post we go further. We will break the
Here's links for quick navigation to each section but I recommend reading this post in order like a short story to see the code and explanations for each optimization, including the intent behind each change. The Scaling Problem Spatial Partitioning Web Workers Component Architecture GPU Compute with WebGPU Results & Interactive Demo
The Scaling Problem
Let's quickly recap where we left off. In part 1 we applied several loop-level optimizations, including a pair-once loop (shared distance computation), avoiding object allocations, and inlining math, and reached around 2,000 particles at 60fps. Assuming
Particles only interact when they are within a cutoff distance
Spatial partitioning
The trick to breaking the
Since particles only interact within
This reduces the force computation from
The uniform grid concept
The simplest spatial data structure for this problem is a uniform grid (also called a hash grid). Every frame, we:
- Build the grid by placing each particle into the cell that contains it
- Query the grid to find all particles in a 3x3 neighbourhood when computing forces
A first attempt might use a Map where keys are cell coordinates and values are arrays of particle indices:
This works and is easy to understand, but has a few problems for a hot loop running 60 times per second:
- Map lookup overhead, where string key hashing is expensive compared to array indexing
- Array allocations, as
pushand the spread operator (...cell) create garbage every frame - GC pressure, since all those short-lived arrays trigger garbage collection pauses causing frame stutters
Building the grid: counting sort
For the production version, we use a technique from sorting algorithms: counting sort. Instead of a Map, we use three flat arrays, allocated once and reused every frame:
gridCellCount[numCells], containing how many particles are in each cellgridCellStart[numCells + 1], a prefix sum giving the start index for each cell in the particle listgridCellParticles[numParticles], a flat list of particle indices grouped by cell
The algorithm runs in two passes over all particles, making it
Note how arrays are only reallocated when the grid grows (with a 1.5x over-allocation factor). They never shrink. This is important because the cutoff distance
Querying neighbours
With the grid built, the force loop changes from "for each particle, check all other particles" to "for each particle, check only particles in 9 neighbouring cells". For reflecting boundaries (where particles bounce off the edges):
The j <= i check ensures each pair is visited only once, the same optimization from part 1. The key saving is in the distance computation. One Math.sqrt call is shared across both directions rather than computed twice. We still look up two separate entries from the force matrix per pair, one for the force on i (using G[typeI][typeJ]) and one for the force on j (using G[typeJ][typeI]). As part 1 explains, the force matrix is not necessarily symmetric, so these magnitudes can differ. Forces point in opposite directions but are not in general equal in magnitude.
Toroidal boundaries
Periodic boundaries (also called toroidal boundaries) make the simulation space wrap around. A particle that exits the right edge re-enters from the left. This creates a seamless, infinite-feeling world with no walls. It adds two complications for the spatial grid:
- Wrapped cell indices, where cells at the edge of the grid must check cells on the opposite edge. We use modular arithmetic instead of clamping, i.e.
(cxi + dcx + gridWidth) % gridWidth Shortest-path distance. When computing the distance between two particles, we need to account for wrapping. The correction, where $W$ is the simulation width (and $H$ its height for the y-direction), is:
This ensures we always use the shortest path around the torus, not the long way around.
There is also a subtle edge case. When the grid has fewer than 3 cells in a dimension, the wrapped cell indices can repeat. For example, if the grid is only 2 cells wide, cell index (0 - 1 + 2) % 2 = 1 and (0 + 1 + 2) % 2 = 1, the same cell. Without deduplication, we would apply forces for the same pair twice. We use a small fixed-size buffer to track which cells have already been visited:
Web Workers
Why off-thread physics?
Even with spatial partitioning, 10,000+ particles can take several milliseconds of physics computation per frame. If that runs on the main thread, it competes with rendering and UI event handling (clicks, scrolls, hover effects). The result is stuttery animation and an unresponsive page.
A Web Worker runs JavaScript in a separate OS thread. We can move the entire physics update, including building the grid, computing forces, and integrating motion, into a worker, leaving the main thread free to render and handle events. The main thread draws the last known particle state while the worker computes the next one.
ArrayBuffer transfer (zero-copy)
The particle data for 10,000 particles is a Float32Array of 60,000 floats (6 floats per particle, including position, velocity, type, and damping):
Copying 240KB of data between threads on every frame would be expensive. Instead, we use the Transferable API to transfer ownership of the underlying ArrayBuffer. This is a zero-copy operation, meaning the memory itself does not move, only the ownership changes. The sender's reference becomes detached (unusable) until the receiver sends it back.
This pattern creates a natural ping-pong. The main thread owns the buffer, sends it to the worker, waits for the worker to finish, gets it back, uploads to the GPU, then sends it to the worker again.
The worker message loop
Inside the worker, the message handler is straightforward. It receives data, runs physics, measures time, and sends results back:
The main thread sets up the worker and handles responses:
One important detail. The rendering loop on the main thread runs independently of the worker. It keeps drawing the last known particle positions at 60fps regardless of how long the worker takes. If the worker takes longer than one frame (16.7ms), the simulation effectively runs at a lower tick rate, but the animation remains smooth. The user never sees a frozen screen.
Architecture
With spatial partitioning for performance and a Web Worker for responsiveness, we have the building blocks of a production simulation, but the code also needs to be maintainable. In this section we look at the architectural patterns that keep the system organized and predictable.
The component architecture
The simulation is organized in a three-layer architecture:
- Component layer, the Angular component that handles UI controls (play/pause, particle count slider, screenshot button) and binds to inputs
- Service layer, using Angular Signals for reactive state and wrapping the lifecycle in a finite state machine
- Renderer + Backend layer, covering the animation loop, performance tracking, and a pluggable backend interface that encapsulates all rendering and physics
The key abstraction is ISimulationBackend. Both the WebGL backend (which uses a Web Worker for CPU physics and WebGL2 for rendering) and the WebGPU backend (which runs everything on the GPU via compute shaders) implement the same interface:
The renderer picks the best available backend at startup (WebGPU if supported, WebGL2 as fallback) and delegates all work through this interface. No if/else branches, no feature flags. Adding a new backend in the future requires no changes to the renderer or component.
The state machine
A particle simulation has a complex lifecycle. It needs to initialize (possibly async), allocate buffers, idle, run, pause, handle color changes, and clean up on destroy. Without careful management, it is easy to end up with boolean flags like isRunning, isInitialized, isPaused that interact in surprising ways.
Instead, we use a generic finite state machine (FSM). The FSM is defined separately from the particle simulation. It is a reusable utility class that works with any set of states:
A few things to notice about this design:
- Closed state space, where the type parameter
T extends stringconstrains states to a fixed union type. You cannot accidentally transition to an undefined state. - Open event space, where events are plain strings, so you can add new events without modifying the FSM class itself.
- Guard conditions, where transitions can have guard functions that must return
truefor the transition to proceed. This prevents invalid transitions at runtime. - Async callbacks, as
onEnterandonExitcan return promises, supporting async initialization.
For the particle simulation, we define 7 states with their transitions:
State diagram:
Every lifecycle operation (play(), pause(), stop(), resetSimulationData()) is just a call to fsm.send(eventName). If the transition is not valid from the current state, nothing happens. This makes the system predictable. You never end up in an impossible state, and you never need to check multiple boolean flags.
Backend abstraction in practice
WebGLBackend.submitFrame() is called 60 times per second by the renderer.
Notice how this method does two independent things. It dispatches physics to the worker (if the worker is not already busy), and it draws the last known state to the screen. The worker is fire-and-forget, so when it finishes, its onmessage handler copies the updated positions into the draw buffer, ready for the next render pass.
The WebGPU backend has the same submitFrame signature but a completely different implementation. It dispatches compute shaders on the GPU and uses ping-pong buffers instead of a worker. The renderer does not care, it calls the same interface.
GPU compute with WebGPU
The CPU path is essentially saturated. Even with spatial partitioning and a Web Worker, force computation is still sequential on a single thread. A modern GPU has thousands of shader cores sitting idle while the worker runs through the physics. The WebGPU backend moves the entire physics pipeline onto the GPU. The spatial grid build, force accumulation, and motion integration all run as compute shaders with no Web Worker involved.
Particle data lives in GPU storage buffers and never leaves the device each frame. Instead of a shared Float32Array passed between threads, we use a Structure-of-Arrays layout: separate ping-pong buffers for positions (vec2f) and velocities (vec2f), plus immutable per-particle buffers for type (u32), alpha, and dampFactor (precomputed pow(damping, dt)). Add a force accumulation buffer and three grid buffers (cellCount, cellStart, cellParticles), and the entire simulation state lives in GPU memory. Each frame dispatches six compute passes in sequence:
- CLEAR, which zeros the
cellCountbuffer viaclearBuffer - COUNT, one thread per particle, atomically incrementing its cell counter
- PREFIX_SUM, a single thread converting counts to CSR start offsets and resetting
cellCountfor reuse - SCATTER, one thread per particle, writing its index into
cellParticles - FORCE, one thread per particle, walking the 3×3 neighbourhood and accumulating forces
- INTEGRATE, one thread per particle, applying forces and advancing position
The first four passes rebuild the spatial grid. They are the exact same three-pass counting-sort from the Web Worker, just written in WGSL (WebGPU Shading Language) and running on thousands of cores instead of one. The COUNT pass uses atomicAdd because multiple particles can land in the same cell simultaneously:
The FORCE pass follows the same 3×3 neighbourhood walk and 9-slot dedup buffer as the CPU worker. The loop structure differs in one important way, though: each thread processes all j != i, not just j > i:
This is a fundamental structural difference from the CPU worker, not just a translation to WGSL. Here is a side-by-side comparison of the two loop strategies:
| CPU Worker | GPU FORCE Pass | |
|---|---|---|
| Loop condition | j > i (pair-once) | j != i (all neighbours) |
| Matrix lookups per pair | 2, one per direction | 1, own row only |
| Force accumulated for | both particles i and j | particle i only |
| How j's force gets computed | same thread, same iteration | independently by thread j |
| Pairs visited | once | twice, each into a separate buffer (no double-counting) |
The CPU worker uses the pair-once trick because sequential compute is expensive. Skipping half the force evaluations matters on a single thread. Writing to both netForce[i] and netForce[j] in the same sequential loop is safe.
On the GPU, each of the fx, fy) and writes only to forces[i], its own slot. No other thread touches forces[i]. This means thread i examining pair (i, j) and thread j examining the same pair are computing two different quantities. Thread i computes the force on i from j, and thread j computes the force on j from i. Each goes into a separate buffer, so there is no double-counting and no division by two is needed. Writing to a neighbour's slot from multiple threads would require atomic floating-point additions, which are expensive and not natively supported for f32 in WGSL. Letting each thread own only its result avoids the problem entirely. Both threads independently compute the same distance, but since they run in parallel, this does not add to wall-clock latency.
The submitFrame method is simpler than the WebGL version. It encodes all six dispatches into a single command buffer, submits it to the GPU queue, and returns immediately. Timing is measured asynchronously via onSubmittedWorkDone:
One subtlety is back-pressure. The GPU command queue is not infinite. If we encode a new frame before the previous compute pass has finished, the queue grows unbounded. The isComputePending flag guards against this. When the previous frame has not completed yet, we skip the physics update and only redraw with the last known particle positions, keeping the queue bounded.
The other subtlety is timing. submitFrame returns the previous frame's GPU time immediately, so the renderer's requestAnimationFrame loop is never blocked waiting for a GPU fence. The one-frame lag is invisible to the user but keeps the main thread free.
Because force computation is now
Results & interactive demo
Below you can see both backends side by side at the same particle count (N = 5k, periodic boundaries). The left forces the WebGL + Web Worker path so physics runs on the CPU. The right auto-selects the best available backend, choosing WebGPU where your browser supports it. Look at the physics time reported in the stats overlay to compare the two. Try tapping each to pause or resume.
Full-scale demo with all controls available. Try pushing the particle count above N = 10k:
On an M4 Pro Mac Mini the WebGL + Worker backend handles 10,000–25,000 particles at 60 fps. WebGPU on the same machine holds a stable 25,000 particles, reaching 50,000 before the frame budget is exceeded. On an iPhone 13 Pro the WebGL path reaches a few thousand particles. The WebGPU path holds a stable 10,000. These are roughly the limits of the current architecture given the
The key takeaways:
- Spatial partitioning broke the
barrier, giving roughly a 10× improvement in the physics step by reducing it to - Web Workers decoupled physics from rendering, keeping the UI smooth even when physics takes longer than one frame
- The state machine eliminated an entire class of bugs, including impossible states and race conditions during initialization, and made the lifecycle explicit
- Backend abstraction let us add a WebGPU path without touching the renderer or component
- WebGPU compute shaders parallelised the spatial grid and force computation across thousands of GPU cores, raising the stable ceiling to 25,000 on desktop and 10,000 on mobile
Wrapping up
We went from ~2,000 particles at 60 fps (Part 1) to ~25,000 stably with WebGPU on desktop, through a chain of independent improvements. Spatial partitioning broke the
There are still directions left to explore, such as a true parallel GPU prefix-sum using workgroup shared memory, 3D extensions, or real-time force-matrix editing with visual feedback. The fundamentals are here. If you want to dig into the full source, the simulation runs directly on this page and in the Sketches section.