ShiftingSands — Architecture

The app follows MVVM. The SwiftUI ContentView observes a TimerViewModel, which drives a SceneKit scene via a UIViewRepresentable bridge. Three physics/rendering modes: CPU (O(N²) on one thread), GPU (Metal compute + SceneKit node readback), and Metal instanced (Metal compute + mesh expansion, zero CPU readback). All hourglass geometry is procedurally generated — no imported 3D models. Gravity scales inversely with duration: max(5.0/duration, 0.15) — full gravity at 5s, reduced for longer durations. A random colors toggle enables per-particle colour variation via random-hue materials (CPU/GPU) or a GPU color buffer (Metal). CLI launch arguments (-mode, -count, -size, -dumpspawn) enable automated testing. A ShiftingSandsTests target has 16 tests covering CPU and GPU physics engines.

    graph TD
        A["ContentView
SwiftUI overlay + color toggle"] -->|"@StateObject"| B["TimerViewModel
@MainActor"]
        A --> C["HourglassSceneView
UIViewRepresentable"]
        C -->|"makeCoordinator()"| D["Coordinator
SCNSceneRendererDelegate"]
        D -->|"owns"| E["HourglassScene
SCNScene builder"]
        E -->|"glass from"| F["SandGeometry
dynamic profiles"]
        E -->|"CPU physics"| G["GranularSimulation
O(N squared) CPU"]
        E -->|"GPU/Metal physics"| H["MetalPhysicsEngine
Metal compute + mesh expansion
+ color buffer"]
        E -->|"thread safety"| LOCK["NSLock
particleLock"]
        B -->|"particleCount, duration,
randomColors, physicsMode
(UserDefaults)"| D
        CLI["CLI Args
-mode, -count,
-size, -dumpspawn"] -.->|"override on launch"| B
        TESTS["ShiftingSandsTests
16 tests: CPUPhysicsTests (10)
GPUPhysicsTests (6)"] -.->|"tests"| G
        TESTS -.->|"tests"| H

        style A fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style B fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style C fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style D fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style E fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style F fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style G fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style H fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style LOCK fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style CLI fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style TESTS fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0

2. Scene Node Hierarchy

All hourglass geometry lives under a container node so the entire hourglass can rotate during the flip animation. Camera and lights stay fixed on the root node. CPU/GPU modes use N individual sphere nodes (with optional per-node random-hue materials when random colors is enabled); Metal mode uses a single mesh node with per-vertex color.

    graph TD
        ROOT["scene.rootNode"] --> HC["hourglassContainer
rotates during flip"]
        ROOT --> CAM["cameraNode
FOV 40, z=3.0"]
        ROOT --> LIGHTS["5 light nodes
key=120, front=100, fill=60,
rim=60, bottom=8, env=0.08"]

        HC --> OG["outerGlassNode
Blinn material, visible"]
        HC --> IG["innerGlassNode
invisible, physics collider"]
        HC --> TC["topCapNode"]
        HC --> BC["bottomCapNode"]
        HC --> PC["particlesContainer
CPU/GPU modes"]
        PC --> S1["sphere node 1"]
        PC --> S2["sphere node 2"]
        PC --> SN["sphere node N
shared or per-node SCNSphere
+ sand Lambert (random hue optional)"]
        HC --> RN["rendererNode
Metal mode only"]
        RN --> MG["SCNGeometry from MTLBuffer
icosahedron mesh per particle (42 verts)
+ per-vertex color"]

        style ROOT fill:#241e17,stroke:#d4a853,color:#e8ddd0
        style HC fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style CAM fill:#241e17,stroke:#8b6e3a,color:#e8ddd0
        style LIGHTS fill:#241e17,stroke:#8b6e3a,color:#e8ddd0
        style OG fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style IG fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style TC fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style BC fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style PC fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style S1 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style S2 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style SN fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style RN fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style MG fill:#2e2620,stroke:#d4a853,color:#e8ddd0

3. Timer State Machine

The timer uses a two-phase start: tapping Start triggers a 1.2-second flip animation, and the actual countdown only begins when the animation completes. Particles tumble with real physics during the flip.

    stateDiagram-v2
        [*] --> Resting: App launch
        Resting --> Flipping: startFlip()
        Flipping --> Running: completeFlip()
        Running --> Complete: elapsed >= duration
        Complete --> Resting: reset() respawns particles
        Running --> Resting: reset() respawns particles
        Flipping --> Resting: reset()

        Resting: particles settled at bottom
        Resting: all controls visible and enabled
        Resting: presets — 5s, 10s, 30s
        Resting: duration 5-30s (default 5s)
        Resting: physics mode picker (CPU/GPU/Metal)
        Flipping: isFlipping=true
        Flipping: rebuilds particles if settings changed
        Flipping: 180 rotation, real physics tumble
        Flipping: top-right controls always enabled
        Running: isRunning=true
        Running: particles flow through neck
        Running: gravity scales with duration
        Running: mode change triggers restart
        Complete: isComplete=true
        Complete: particles at rest
        Complete: respawns particles on transition

4. CPU Physics Engine

Each frame, the CPU physics engine runs multiple substeps. Within each substep: apply gravity, update positions, then resolve collisions (wall, floor/ceiling, sphere-sphere), and apply damping. Substep count adapts to particle count for frame rate.

    graph TD
        subgraph "Per Frame (120fps)"
            DT["Compute dt
clamped to 1/30s max"]
            EULER["Read container euler angle
presentation.eulerAngles.x"]
        end

        subgraph "Per Substep (4x)"
            GRAV["Apply gravity
gravity=max(5.0/duration, 0.15),
rotated by container angle"]
            POS["Update positions
p += v * subDt"]
            WALL["Wall collision
radial vs innerRadiusAt(y)"]
            FLOOR["Floor/ceiling bounds
Y clamp with restitution"]
            SPHERE["Sphere-sphere collision
O(N squared) sequential"]
            NECK["Neck friction zone
neckDamping=duration*0.02
where abs(y) < 0.10"]
            DAMP["Velocity-dependent damping
flow=0.92, settle=0.05
blend by speed/0.15"]
        end

        SYNC["Sync SCNNode positions"]

        DT --> EULER --> GRAV --> POS --> WALL --> FLOOR --> SPHERE --> NECK --> DAMP
        DAMP -->|"repeat substeps"| GRAV
        DAMP --> SYNC

        style DT fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style EULER fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style GRAV fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style POS fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style WALL fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style FLOOR fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style SPHERE fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style NECK fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style DAMP fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style SYNC fill:#2e2620,stroke:#d4a853,color:#e8ddd0

CPU limit: Capped at 250 particles. O(N²) collision runs smoothly at 120fps on A17 Pro. Use GPU mode for higher counts (up to 10,000) or Metal mode (up to 50,000). Gravity scaling: gravity is set per frame by the Coordinator: max(5.0/duration, 0.15). At 5s: 1.0 (full). At 30s: ~0.17. Two damping layers: neck friction near the constriction and velocity-dependent damping (blend between flow and settle based on speed). CPU’s sequential collision converges naturally — no sleep system needed. Spawn overflow: the hex lattice fills from the bottom up to y=-0.10 (well below the neck); at low counts it stops early. Particles that cannot fit in the hex lattice are placed randomly in the lower chamber (y=-0.48 to -0.10), never near the neck or in the upper chamber.

5. GPU Physics Engine (Metal Compute)

The GPU engine parallelises the same physics across Metal compute threads. Each thread handles one particle: reads all others from buffer A, resolves all collisions, writes result to buffer B. Buffers swap after each substep. Double-buffering prevents race conditions in parallel collision resolution.

    graph TD
        subgraph "Per Frame (120fps)"
            DT2["Compute dt, euler angle"]
        end

        subgraph "Per Substep (adaptive: 4/3/2/1)"
            UNI["Fill PhysicsUniforms
gravity=max(5.0/duration, 0.15),
neckDamping=duration*0.02,
neckHalfHeight=0.10,
damping, subDt, euler"]
            ENC["Encode physicsStep kernel
bufferA read, bufferB write"]
            DISP["Dispatch N threads
256 per threadgroup"]
            WAIT["waitUntilCompleted"]
            SWAP["Swap bufferA / bufferB"]
        end

        subgraph "Per Thread (one particle)"
            TSLEEP_CHK["Two-tier sleep check
counter > 30: deep sleep
(staggered every 30 frames:
verify floor or nearby particle
at 1.05x dist, wake if none)
counter 16-30: light sleep
(O(N) scan: wake only if
non-sleeping neighbor approaching)"]
            TGRAV["Apply rotated gravity"]
            TPOS["Update position"]
            TWALL["Wall collision
256-entry profile lookup"]
            TFLOOR["Floor/ceiling clamp
+ resting contact: zero vel.y
if vel.y < gravity*subDt*2"]
            TSPHERE["Chamber-asymmetric collision
O(N squared): position correction
0.25 everywhere; velocity impulse
0.25 lower chamber only
(upper: skip impulse, gravity drains)
tracks contactCount for sleep"]
            TNECK["Neck friction
neckDamping where abs(y) < 0.10"]
            TDAMP["Chamber-dependent damping
upper: flow only pow(0.92, subDt)
lower: blend flow/settle by speed"]
            TCUTOFF["Velocity cutoff
lower chamber + hasSupport only:
snap to zero if speed < 0.01
free-falling particles never frozen
(upper must accumulate gravity)"]
            TSLEEP["Sleep counter update
upper chamber: always 0 (awake)
lower chamber + hasSupport:
speed < 0.08 increments counter
no contacts: counter stays 0"]
        end

        READ["readPositions() -> [GPUParticle]
safe copy from bufferA"]
        SYNC2["Sync SCNNode positions
under particleLock"]

        DT2 --> UNI --> ENC --> DISP --> WAIT --> SWAP
        SWAP -->|"repeat substeps"| UNI
        SWAP --> READ --> SYNC2

        DISP -.->|"each thread"| TSLEEP_CHK --> TGRAV --> TPOS --> TWALL --> TFLOOR --> TSPHERE --> TNECK --> TDAMP --> TCUTOFF --> TSLEEP

        style DT2 fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style UNI fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style ENC fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style DISP fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style WAIT fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style SWAP fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style TGRAV fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style TPOS fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style TWALL fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style TFLOOR fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style TSLEEP_CHK fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style TSPHERE fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style TNECK fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style TDAMP fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style TCUTOFF fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style TSLEEP fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style READ fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style SYNC2 fill:#2e2620,stroke:#d4a853,color:#e8ddd0

Performance: GPU mode supports up to 10,000 particles. Adaptive substeps: 4 for ≤1500, 3 for 1501-3000, 2 for 3001-10000, 1 for >10000. Default: GPU mode with 5,000 particles. Per-mode particle count persisted via UserDefaults. Particles spawned in hex close-packed lattice (GranularSimulation.packedPositions). Chamber-asymmetric collision: the critical fix for parallel GPU physics. In packed piles, parallel collision impulses cancel gravity (each thread applies impulse independently, unlike sequential CPU where forces propagate naturally). Lower chamber (pos.y < 0) gets full 0.25 position correction + 0.25 velocity impulse for stable resting piles. Upper chamber (pos.y > 0) gets 0.25 position correction but NO velocity impulse — gravity must dominate to drain the packed pile through the neck. Chamber-dependent damping: upper chamber uses flow-only damping (pow(0.92, subDt)) to preserve natural fall; lower chamber blends between flow and aggressive settle damping (pow(0.05, subDt)) based on speed. Gravity scaling: gravity is no longer constant — it scales inversely with duration: max(5.0/duration, 0.15). At 5s: full gravity (1.0). At 30s: reduced (~0.17). Both CPU and GPU engines receive the scaled gravity value per frame from the Coordinator. Contact-based sleep: the GPU sleep/freeze system tracks contactCount during the O(N²) collision loop. Only particles with contacts (hasSupport = contactCount > 0) can have velocity zeroed or enter sleep. Free-falling particles (zero contacts) never freeze, preventing mid-air freezing artifacts. Velocity cutoff lower-chamber + hasSupport only: snap to zero if speed < 0.01, but only when the particle is in the lower chamber AND has contacts. Upper-chamber particles must accumulate gravity to begin draining. Sleep counter lower-chamber + hasSupport only: upper-chamber particles always have counter=0 (always awake, must drain). Lower chamber uses threshold 0.08 but only increments when hasSupport is true. Two-tier sleep system: counter 16–30 is “light sleep” (O(N) wake-up check, wakes only if a non-sleeping neighbor is approaching fast; sleepContactCount was removed from light sleep because it caused oscillation/blur through the glass — support-loss detection is deferred to deep sleep). Counter >30 is “deep sleep” (staggered support check every 30 frames, offset by thread ID, so ~N/30 particles check per frame: floor contact counts as support via cheap check, otherwise scans for nearby particles within 1.05× touching distance; if no support found, particle wakes and falls within ~0.5s. Staggering reduces deep sleep overhead from O(N²) to O(N²/30) per frame). snapAfterFlip also resets all counters to 0. Floor/ceiling resting contact: vel.y zeroed if abs(vel.y) < gravity * subDt * 2.0, preventing micro-bouncing on flat surfaces.

6. Metal Instanced Mode (Mesh Expansion)

Metal mode eliminates the CPU readback bottleneck. The same GPU physics runs, then a second compute kernel expands each particle into a subdivided icosahedron mesh (42 verts, 80 faces). SceneKit renders the expanded mesh geometry directly from the GPU buffer — zero CPU copies. Supports up to 50,000 particles. Uses Lambert material with UIColor(white: 0.78) diffuse to match CPU/GPU brightness. When random colors is enabled, a per-particle color buffer feeds into the mesh expansion kernel, producing per-vertex color via SCNGeometrySource(.color).

    graph TD
        subgraph "Per Frame"
            PHYS["Physics Compute
physicsStep kernel
(same as GPU mode)"]
            MESH["Mesh Expansion Compute
expandMeshes kernel
particle → 42 icosahedron verts"]
            GEOM["Rebuild SCNGeometry
SCNGeometrySource(buffer:)
wraps vertex MTLBuffer
+ SCNGeometrySource(.color)"]
        end

        subgraph "Mesh Expansion Per Thread"
            READ_P["Read particle position
from physics bufferA"]
            READ_C["Read particle color
from colorBuffer (uchar4)"]
            WRITE_V["Write MeshVertex
packed_float3 pos + normal
+ uchar4 color = 28 bytes"]
        end

        subgraph "Rendering"
            NODE["rendererNode
child of hourglassContainer"]
            SCNR["SceneKit Standard Pipeline
Lambert material (white: 0.78),
depth, lighting
per-vertex color override"]
        end

        PHYS --> MESH --> GEOM --> NODE --> SCNR
        MESH -.->|"N * 42 threads"| READ_P --> READ_C --> WRITE_V

        style PHYS fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style MESH fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style GEOM fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style READ_P fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style READ_C fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style WRITE_V fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style NODE fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style SCNR fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0

Key difference from GPU mode: GPU mode reads positions back to CPU (readPositions()) and updates N individual SCNNodes per frame. Metal mode keeps everything on GPU — the mesh expansion kernel writes vertices that SceneKit reads directly via SCNGeometrySource(buffer:). One geometry node replaces N sphere nodes. The index buffer (icosahedron topology × N particles) is pre-computed and static; only the vertex buffer changes each frame. Uses Lambert material with UIColor(white: 0.78) diffuse to match CPU/GPU brightness. Per-vertex color: MeshVertex is 28 bytes (packed_float3 position + packed_float3 normal + uchar4 color). When random colors is enabled, MetalPhysicsEngine.setupColors(random:) fills a colorBuffer with per-particle uchar4 RGBA values. The expandMeshes kernel reads each particle’s color and writes it into every vertex of that particle’s icosahedron. SceneKit picks up the color via an SCNGeometrySource(.color) semantic on the same vertex buffer.

7. Dynamic Glass Geometry

The hourglass glass shape is defined by control points, interpolated into a smooth profile, then revolved around the Y axis. The neck width adjusts dynamically based on particle size so exactly one ball fits through at a time.

    graph TD
        PC["Particle Count + Size
CPU: 50-250, GPU: 50-10k,
Metal: 50-50k, size: 0.5x-1.5x"] -->|"radiusForCount(count, sizeMultiplier)"| PR["Particle Radius
0.030 * (100/N)^(1/3)
* packing correction * sizeMultiplier"]
        PR -->|"r + wall + clearance"| NR["Neck Radius
setNeckRadius()"]
        NR --> CP["11 Control Points
dynamic neck point"]
        CP -->|"Catmull-Rom spline"| SP["~80 Smooth Points"]
        SP -->|"rotate around Y"| SR["Surface of Revolution
64 angular segments"]
        SR --> OG["Outer Glass
Blinn, transparent"]
        SR --> IG["Inner Glass
invisible collider"]
        SP -->|"activeInnerProfile"| IR["innerRadiusAt(y)
wall collision lookup"]
        IR -->|"sample 256 points"| LUT["GPU Profile Table
256-entry MTLBuffer"]

        style PC fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style PR fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style NR fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style CP fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style SP fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style SR fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style OG fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style IG fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style IR fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style LUT fill:#2e2620,stroke:#d4a853,color:#e8ddd0

8. Flip Animation Sequence

The flip exploits the glass's Y-axis symmetry: rotating 180 degrees produces an identical shape. Flip start rebuilds particles only if count/size/color settings changed. Particles tumble with real physics as gravity rotates with the container. Snap-back runs on CPU or GPU depending on active physics mode.

    sequenceDiagram
        participant U as User
        participant VM as TimerViewModel
        participant SV as SceneView
        participant HS as HourglassScene
        participant PHY as Physics Engine

        U->>VM: tap Start
        VM->>VM: startFlip() isFlipping=true
        SV->>HS: flipAndStart(completion)
        HS->>HS: SCNAction.rotateBy(x: pi, 1.2s)
        Note over HS,PHY: Each frame during rotation:
        PHY->>PHY: read presentation.eulerAngles.x
        PHY->>PHY: compute rotated gravity
        PHY->>PHY: step physics (balls tumble)
        HS->>HS: sync node positions (CPU/GPU)
or rebuild mesh geometry (Metal)
        Note over HS: rotation completes
        HS->>HS: snap eulerAngles to (0,0,0)
        HS->>PHY: snapAfterFlip() (CPU or GPU kernel)
        HS->>HS: sync positions / rebuild geometry
        HS->>VM: completion -> completeFlip()
        VM->>VM: isRunning=true, start timer task

9. Data Flow

Published properties flow from the ViewModel through the SwiftUI bridge into SceneKit. The render delegate dispatches to CPU, GPU, or Metal instanced physics and syncs positions every frame. Per-mode particle counts, duration, and random colors preference are persisted independently in UserDefaults. The Coordinator computes gravity scaling (max(5.0/duration, 0.15)) each frame and sets it on both CPU and GPU engines. CLI launch arguments (-mode, -count, -size, -dumpspawn) override persisted settings on launch for automated testing. Manual Reset respawns particles; natural completion leaves them at rest. Flip start rebuilds only if settings changed.

    graph TD
        UD["UserDefaults
physicsMode, duration,
randomColors,
particleCount_CPU/GPU/Metal,
particleSize_CPU/GPU/Metal"]
        CLI2["CLI Args
-mode, -count, -size,
-dumpspawn"]
        VM["TimerViewModel
@Published: duration (5-30s),
elapsed, isRunning, isFlipping,
isComplete, particleCount,
particleSizeMultiplier,
physicsMode, randomColors"]

        UD -->|"load on init"| VM
        CLI2 -.->|"override on launch"| VM
        VM -->|"save on change"| UD
        VM -->|"SwiftUI binding"| CV["ContentView
top-left: color toggle
top-right: mode picker + particle slider
+ size slider
bottom-left: readout
bottom-right: duration slider (5-30s)
+ presets (5s/10s/30s) + start/reset"]
        VM -->|"updateUIView()"| COORD["Coordinator
tracks physicsMode,
isFlipping, particleCount,
sizeMultiplier, randomColors,
currentDuration,
sets neckDamping + gravityScale
(max(5.0/duration, 0.15)) each frame"]
        COORD -->|"renderer(updateAtTime:)"| HS["HourglassScene"]
        HS -->|"CPU mode"| CPU["GranularSimulation
O(N squared) single thread
+ SCNNode sync
+ Lambert, 24-color palette"]
        HS -->|"GPU mode"| GPU["MetalPhysicsEngine
Metal compute
+ readPositions() + SCNNode sync
+ Lambert, 24-color palette"]
        HS -->|"Metal mode"| MTL["MetalPhysicsEngine
Metal compute
+ setupColors(random:)
+ expandMeshes() with colorBuffer
+ SCNGeometry with per-vertex color"]

        style UD fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style CLI2 fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style VM fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style CV fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style COORD fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style HS fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style CPU fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style GPU fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style MTL fill:#2e2620,stroke:#d4a853,color:#e8ddd0

10. Metal Compute Architecture

The MetalPhysicsEngine manages double-buffered particle data, a pre-computed profile lookup table, a per-particle color buffer, and dispatches compute kernels via MTLCommandQueue. Three pipelines: physics step, snap-after-flip, and mesh expansion for Metal instanced mode. The color buffer is populated by setupColors(random:) and read by the expandMeshes kernel to produce per-vertex color in each MeshVertex (28 bytes).

    graph TD
        ME["MetalPhysicsEngine"] -->|"owns"| DEV["MTLDevice"]
        ME -->|"owns"| CQ["MTLCommandQueue"]
        ME -->|"owns"| PP["physicsPipeline
MTLComputePipelineState"]
        ME -->|"owns"| SP2["snapPipeline
MTLComputePipelineState"]
        ME -->|"owns"| MP["meshExpansionPipeline
MTLComputePipelineState"]

        ME -->|"double buffer"| BA["bufferA
MTLBuffer shared"]
        ME -->|"double buffer"| BB["bufferB
MTLBuffer shared"]
        ME -->|"wall profile"| PB["profileBuffer
256 floats"]
        ME -->|"per-particle color"| CB["colorBuffer
N * uchar4 RGBA"]
        ME -->|"mesh output"| VB["meshVertexBuffer
N * 42 MeshVertex (28B each)"]
        ME -->|"static indices"| IB["meshIndexBuffer
N * 240 UInt32"]

        BA -->|"read"| K["physicsStep kernel"]
        BB -->|"write"| K
        PB -->|"lookup"| K

        BA -->|"read"| MK["expandMeshes kernel"]
        CB -->|"read"| MK
        VB -->|"write"| MK

        BA -->|"GPU mode only"| COPY["readPositions()
safe copy to CPU"]
        K -->|"per thread"| T["GPUParticle
float4 pos+r
float4 vel+pad"]
        CB -.->|"setupColors(random:)"| SC["Per-particle RGBA
random hue or golden sand"]

        style ME fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style DEV fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style CQ fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style PP fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style SP2 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style MP fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style BA fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style BB fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style PB fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style CB fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style VB fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style IB fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0
        style K fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style MK fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style T fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style COPY fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style SC fill:#2e2620,stroke:#c47d2e,color:#e8ddd0

11. Three-Mode Comparison

The three modes share the same physics (gravity, collision, neck friction) but differ in where computation happens and how particles are rendered.

    graph TD
        subgraph "CPU Mode (up to 250)"
            C1["CPU Physics
O(N squared) single thread"]
            C2["N SCNNodes
shared or per-node SCNSphere
Lambert material
24-color palette (random colors)"]
            C1 -->|"position sync"| C2
        end

        subgraph "GPU Mode (up to 10k)"
            G1["GPU Physics
Metal compute kernel"]
            G2["readPositions()
CPU copy from bufferA"]
            G3["N SCNNodes
shared or per-node SCNSphere
Lambert material
24-color palette (random colors)"]
            G1 --> G2 -->|"position sync"| G3
        end

        subgraph "Metal Mode (up to 50k)"
            M1["GPU Physics
Metal compute kernel"]
            M2["expandMeshes
Metal compute kernel
+ colorBuffer lookup"]
            M3["1 SCNNode
SCNGeometry from MTLBuffer
Lambert material (white: 0.78)
per-vertex color (28B MeshVertex)"]
            M1 --> M2 -->|"zero copy"| M3
        end

        style C1 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style C2 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style G1 fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style G2 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style G3 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style M1 fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style M2 fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style M3 fill:#2e2620,stroke:#d4a853,color:#e8ddd0

12. Test Mode & Test Target

Launch arguments enable automated testing without manual interaction. -test sets a 10-second duration, auto-starts the timer, and on completion dumps all particle positions and velocities to Documents/test_results.txt. -autostart just auto-starts the timer without the data dump. -multicolor enables the random colors mode on launch. -mode CPU|GPU|Metal overrides the physics mode. -count N overrides the particle count. -size X overrides the particle size multiplier. -dumpspawn enables spawn position logging.

The ShiftingSandsTests target contains 16 unit tests for the physics engines (10 CPU, 6 GPU):

13. Physics Primer

This section explains the mathematical foundations behind ShiftingSands — how shapes are built, how particles move, how collisions work, and how all of this maps onto GPU parallel execution. You only need basic algebra to follow along.

Building the Hourglass Shape

The hourglass isn’t loaded from a 3D model file — it’s built entirely from maths. The process has two stages: define a 2D profile curve, then spin it around an axis to make a 3D surface.

Stage 1: The profile curve. We place 11 control points on a 2D plane that trace the outline of half the hourglass — wide at the top, narrow at the neck, wide at the bottom. But 11 points would give a jagged polygon. To make it smooth, we use Catmull-Rom spline interpolation: a formula that draws a smooth curve passing exactly through each control point. Between any two points, the curve considers the neighbouring points on either side to calculate a gentle, natural-looking path. The result is ~80 smooth points that trace a flowing hourglass silhouette.

Stage 2: Surface of revolution. Imagine holding that 2D profile vertically and spinning it 360° around a central axis, like a potter’s wheel. Every point on the profile traces a circle. We sample 64 positions around each circle, creating a mesh of triangles that forms the 3D glass surface. For a profile point at height y and distance r from the axis, each of the 64 positions is:

Surface point at angle θ:
x = r × cos(θ)
y = y (unchanged)
z = r × sin(θ)

where θ goes from 0 to 360° in 64 equal steps. This produces ~80 × 64 = 5,120 vertices forming the glass surface.

The dynamic neck adjusts automatically based on particle size. When particle count changes, we recalculate the radius and set the neck just wide enough for exactly one ball to fit through: neckRadius = particleRadius + wallThickness + 0.002. This creates natural single-file flow, just like a real hourglass.

Representing Particles

Each particle (“grain of sand”) is a perfect sphere stored as just three properties:

position = (x, y, z) — where the centre of the sphere is in 3D space
velocity = (vx, vy, vz) — how fast and in which direction it’s moving
radius = r — the size of the sphere

That’s it. Every frame, we update position and velocity using simple equations, and all the complex behaviour — tumbling, piling, flowing — emerges from these updates.

Gravity: Making Things Fall

Gravity pulls particles downward. Newton tells us F = m × g, but since all our particles have equal mass, we can simplify. Each time step (a tiny fraction of a second, called dt), we update velocity then position:

Step 1 — Accelerate: velocity.y = velocity.y − gravity × dt
Step 2 — Move: position = position + velocity × dt

This is Euler integration — the simplest way to simulate motion. Each tick, gravity adds a small downward nudge to velocity, and the particle moves according to its current velocity. Over many tiny steps, you get a smooth parabolic arc, just like a thrown ball in real life.

During the flip animation, the hourglass container rotates. Gravity always points straight down in the real world, but the particles live inside the rotating container. We transform gravity into the container’s local frame using trigonometry:

Rotated gravity (container tilted by angle θ around the X axis):
gravity_y = −g × cos(θ)
gravity_z = g × sin(θ)

At θ=0 (upright): gravity is purely downward (−g). At θ=90° (sideways): gravity is purely sideways. At θ=180° (flipped): gravity is upward (+g) in the container’s frame, which makes particles “fall” toward what was the top.

Collision Detection: Did Two Balls Touch?

Two spheres overlap when the distance between their centres is less than the sum of their radii. This is the fundamental test behind all the physics:

O(N²) brute force: With N particles, we check every possible pair. That’s N × (N−1) / 2 distance calculations per step. At 250 particles: ~31,000 checks. At 10,000 particles: ~50 million checks. At 50,000: ~1.25 billion. This is where the GPU becomes essential — it can perform millions of checks in parallel.

Collision Response: What Happens When Balls Collide

When two spheres overlap, we need to do two things: push them apart so they stop overlapping, and exchange velocity so they bounce realistically.

1. Position correction (push apart):
The collision normal is the direction from one centre to the other: normal = (posA − posB) / distance. We push each particle away along this direction by half the overlap:
posA = posA + normal × (overlap × 0.5)
posB = posB − normal × (overlap × 0.5)

2. Velocity impulse (bounce):
We calculate how fast the particles are approaching each other along the collision normal:
approachSpeed = dot(velA − velB, normal)

If approachSpeed < 0, they’re moving toward each other. We apply an impulse that reverses this relative motion, scaled by the restitution (bounciness, set very low at 0.02 — sand doesn’t bounce much):
impulse = −(1 + restitution) × approachSpeed × 0.5
velA = velA + normal × impulse
velB = velB − normal × impulse

The 0.5 factors ensure each particle gets half the correction. The result: particles that collide gently come to rest against each other; particles that collide fast bounce apart.

Wall Collision: The Symmetry Trick

The hourglass glass is a surface of revolution — perfectly symmetric around the Y axis. This lets us reduce the 3D wall collision to a simple 2D check:

1. Compute radial distance: radial = √(x² + z²)
This is how far the particle is from the central axis, ignoring height.

2. Look up glass radius: glassR = innerRadiusAt(y)
For the particle’s current height, what is the glass inner radius? (On CPU: search the ~80 profile points. On GPU: a 256-entry lookup table for instant access.)

3. Compare: If radial > glassR − particleRadius, the particle has hit the wall. Push it inward, reflect its outward velocity, and apply friction to its tangential (sliding) velocity.

Because of the Y-axis symmetry, we never need to check individual triangles of the glass mesh — just one distance comparison per particle per step.

Damping: Bringing Things to Rest

Without damping, particles would bounce forever. Real sand loses energy through internal friction and deformation. We simulate this with velocity damping — each step, we multiply velocity by a factor slightly less than 1:

Flow damping: velocity = velocity × 0.92^dt — gentle, preserves natural arc of falling particles
Settle damping: velocity = velocity × 0.05^dt — aggressive, quickly kills jitter when particles are nearly still

The simulation blends between these based on speed. Fast-moving particles (falling, flowing) get gentle damping. Slow-moving particles (settling into a pile) get aggressive damping. The blend formula:
t = min(speed / 0.15, 1.0)
dampFactor = settleDamp + (flowDamp − settleDamp) × t

At full speed (t=1): almost pure flow damping. At rest (t=0): almost pure settle damping.

Neck friction adds extra damping near the hourglass constriction (where |y| < 0.10). The closer to the centre, the stronger the damping. This controls flow rate — at short timer durations the friction is mild and particles stream through; at longer durations it’s stronger and they trickle.

Substeps: Accuracy vs Speed

The screen refreshes 120 times per second (on iPhone’s ProMotion display). Each frame, we don’t just run physics once — we subdivide the frame into multiple substeps. If the frame time is 1/120th of a second and we use 4 substeps, each substep simulates 1/480th of a second. Smaller steps mean:

More accurate collisions — fast particles can’t “tunnel” through each other between checks
More stable piles — small position corrections add up smoothly
But more computation — 4 substeps means 4× the collision checks

The simulation adapts: 4 substeps at low counts, down to 1 substep at 50,000 particles. Fewer substeps at high counts trades accuracy for keeping the frame rate smooth.

Mapping Physics onto the GPU

The CPU runs physics one particle at a time, sequentially. A GPU has thousands of small processing cores that can all work at once. The key insight: each particle’s physics is mostly independent, so we can assign one GPU thread per particle and run them all simultaneously.

    graph TD
        subgraph "CPU: Sequential"
            CP1["Particle 1
check all others"]
            CP2["Particle 2
check all others"]
            CP3["Particle 3
check all others"]
            CPN["Particle N
check all others"]
            CP1 -->|"then"| CP2 -->|"then"| CP3 -->|"..."| CPN
        end

        subgraph "GPU: Parallel"
            GP1["Thread 1
Particle 1 vs all"]
            GP2["Thread 2
Particle 2 vs all"]
            GP3["Thread 3
Particle 3 vs all"]
            GPN["Thread N
Particle N vs all"]
        end

        T["Total time"] -.->|"CPU: N × work"| CP1
        T -.->|"GPU: 1 × work
(all threads parallel)"| GP1

        style CP1 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style CP2 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style CP3 fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style CPN fill:#2e2620,stroke:#c47d2e,color:#e8ddd0
        style GP1 fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style GP2 fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style GP3 fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style GPN fill:#2e2620,stroke:#d4a853,color:#e8ddd0
        style T fill:#2e2620,stroke:#8b6e3a,color:#e8ddd0

Threadgroups. GPU threads are organised into groups of 256. If we have 10,000 particles, we dispatch ceil(10,000 / 256) = 40 threadgroups. The GPU hardware schedules these across its cores automatically. Each thread runs the exact same program (the compute kernel) but on a different particle.

The Parallel Collision Problem

There’s a fundamental challenge when running collisions in parallel. On the CPU, collisions are resolved sequentially: when particle A pushes particle B, B’s new position is immediately visible to the next collision check. Forces propagate naturally through the pile.

On the GPU, every thread reads the same snapshot of all particle positions (from the start of the substep). Thread 1 doesn’t see Thread 2’s corrections. This causes two problems:

Problem 1: Double-counting. When particles A and B collide, Thread A pushes A away from B, and Thread B independently pushes B away from A. Both apply the full correction. The fix: each thread applies only ¼ of the position correction and ¼ of the velocity impulse (not ½ like the CPU). Both threads independently apply their quarter, totalling ½ combined — matching the CPU’s result.

Problem 2: Pile jamming. In a packed pile in the upper chamber, every particle is touching several neighbours. On the CPU, gravity pulls the bottom particle down, which creates space for the next one, and the pile drains naturally. On the GPU, all particles simultaneously push each other apart, and these parallel impulses cancel out gravity — the pile freezes in place.

The fix: chamber-asymmetric collision. In the upper chamber (waiting to drain), we apply position correction but skip velocity impulses. Gravity is the only force that moves particles, so the pile drains naturally one particle at a time. In the lower chamber (catching particles), we apply full correction + impulse for stable settling.

Double Buffering: Avoiding Race Conditions

If every thread read and wrote to the same memory, chaos would ensue — Thread 5 might read Particle 3’s position while Thread 3 is halfway through writing a new value. The solution is double buffering:

Every thread reads from one buffer and writes to the other. After all threads finish, the buffers swap roles for the next substep. This guarantees every thread sees a consistent, complete snapshot of particle positions — no half-written data, no race conditions. The GPU hardware ensures all threads complete before the swap via waitUntilCompleted.

The Complete GPU Physics Pipeline

Putting it all together, here’s what happens for each particle in a single GPU substep. Every particle runs this exact same sequence simultaneously on its own thread:

1. Sleep check — if this particle has been still for 30+ frames (deep sleep), perform a staggered support check (every 30 frames, offset by thread ID, so ~N/30 particles check per frame): floor contact counts as support (cheap check), otherwise scan for nearby particles within 1.05× touching distance; if no support found, the particle wakes and falls. If still for 15–30 frames (light sleep), scan neighbors; wake only if any non-sleeping neighbour is approaching fast (relVelNormal < -0.05). Support-loss detection is deferred to the deep sleep staggered check. This saves GPU work when particles are at rest while ensuring they wake when support is removed (within ~0.5s).

2. Apply gravity — vel += gravity × dt (rotated during flip)
3. Update position — pos += vel × dt
4. Wall collision — look up glass radius from 256-entry table, push inward if outside
5. Floor/ceiling — clamp Y, detect resting contact
6. Sphere-sphere — loop over ALL other particles (the O(N²) core), apply ¼ corrections, track contact count
7. Neck friction — extra damping near the constriction
8. Velocity damping — flow/settle blend based on speed and chamber
9. Velocity cutoff — snap to zero if very slow AND in contact with something (prevents mid-air freezing)
10. Sleep counter — increment if slow and supported, reset if in upper chamber

Mesh Expansion: From Numbers to Pixels

Physics produces a list of positions. But to see particles, the GPU needs triangle meshes. In CPU/GPU mode, SceneKit renders N individual sphere nodes. In Metal mode, a second compute kernel (expandMeshes) converts each position into a small sphere mesh on the GPU — no data ever leaves the graphics card.

Each particle is expanded into a subdivided icosahedron — a 42-vertex, 80-triangle shape that closely approximates a sphere. For each of the 42 template vertices (which lie on a unit sphere), the kernel computes:

vertexPosition = particleCentre + templateVertex × radius
vertexNormal = templateVertex (points outward — enables lighting)

At 10,000 particles: 420,000 vertices and 800,000 triangles, all generated in a single GPU pass. At 50,000 particles: 2.1 million vertices and 4 million triangles. SceneKit wraps the GPU buffer directly with SCNGeometrySource(buffer:) — zero memory copies from GPU to CPU.

⏳ ShiftingSands

1. High-Level Overview