Kernel fusion is a known technique — TVM, XLA, and Burn already fuse operator patterns. What I shipped: full transformer decode and full evolutionary fitness loops as a single WebGPU dispatch, measured across 92 real-world devices and 7 GPU vendors. Median 71× on Apple Silicon, 56× on NVIDIA, 20× on phones; peaks 226× / 402× / 103×. Zero installation. Any browser.
How eager dispatch works
dispatch
step 1 → wait → dispatch step 2 → wait... × 1,500 steps = 22,500 round-trips
92%+ of time = waiting, not computing
Kernel fusion
dispatch once
→ GPU loops internally1,500 steps in 1 round-trip
100% of time = computing
Watch 50 neural networks learn to play Flappy Bird in real-time. GPU evaluates 4,096 birds per dispatch via kernel fusion. Open in multiple tabs to connect via WebRTC and evolve together.
4,096-population evolutionary optimization on a 2,000-dimensional multimodal landscape. Measures raw GPU throughput of the fused evolutionary kernel.
Fused attention + FFN + LayerNorm in a single GPU dispatch. Benchmarks unfused, fused, parallel, and f16 variants across model dimensions.
WebRTC P2P Genome Exchange for Island-Model Optimization
Browser tabs form evolutionary islands, exchanging elite genomes directly via WebRTC data channels. A 113-line signaling relay brokers the handshake; all genome data flows peer-to-peer. Private rooms by default. Validated on Rastrigin (N=30), across Apple Metal and NVIDIA Vulkan.
via WebGPU Compute Shaders
Fusing sequential fitness evaluations into single GPU dispatches eliminates per-step kernel launch overhead. Proven across CUDA, WebGPU, JAX/XLA, and Triton on two hardware platforms.
via WebGPU Compute Shaders
Browser LLM engines dispatch 1,024 separate GPU kernels per generation. We fuse everything into one dispatch. Single-threaded: 6.6-13.5×. Parallel kernel (64 threads + shared memory): 66-458×. Beats PyTorch MPS by 7.5-161× at all tested sizes up to D=256. 16,410 tok/s at D=32.
Since publishing, 92 unique devices across 7 GPU vendors have run the benchmarks. Medians shown.
Mobile transformer runs: 36 confirmed across iOS Safari and Android Chrome. Peak: 213.000 tokens/sec on a phone. Average: 15.000 tokens/sec.
Browser coverage: Chrome (347), Firefox (69), Safari (62). macOS, Windows, Linux, Android, iOS. No installation on any of them.
30 seconds. No installation. Your result joins the live dataset above.
We don't cherry-pick results. Every benchmark run from every device is published in a searchable, sortable, downloadable dataset. GPU name, score, browser, OS, timestamp — all of it. No data is hidden. Verify any claim yourself.
Browse all 92results →One import. One dispatch. All tokens, all layers, all operations fused into a single GPU kernel.
npm install @webgpu-fusion/core
// 3 lines to benchmark your GPU
import { FusedTransformer } from '@webgpu-fusion/core'
const model = await FusedTransformer.create({ dModel: 128, nHeads: 2, nLayers: 4 })
const stats = await model.benchmark({ runs: 10 })
TypeScript. f32 and f16 precision. Int4 quantization. Single-thread and parallel (64-thread shared memory) modes. Works in Chrome, Firefox, Safari — any WebGPU-capable browser.
The single-kernel fusion pattern generalises beyond synthetic benchmarks. The flagship application ports a production scientific toolkit — Geant4-DNA (CNRS/IN2P3) — to the browser. Three adjacent projects apply the same pattern to LLM inference, LLM visualisation, and open GPU benchmarking.
Electron track-structure simulation ported from the CNRS/IN2P3 Geant4-DNA toolkit to WebGPU. One thread per primary, full 10 keV history in a single for-loop. Radiolysis chemistry and DNA damage scoring live in a browser tab.The “one dispatch, full history” shape is the same kernel-fusion pattern that gives 3–4 orders of magnitude of speedup on launch-bound workloads — here it's what makes real Monte Carlo radiobiology cheap enough to run live in a browser tab.
Full tabulated cross sections from G4EMLOW 8.8: Born ionisation, Emfietzoglou excitation, Champion elastic CDF, Sanche vibrational. Karamitros 2011 9-reaction IRT radiolysis. Direct + indirect SSB scoring against a 21×21 parallel B-DNA fiber grid. Validated at 8 energies (100 eV – 20 keV).
See the simulation →Open WebGPU compute benchmarks across 92 unique devices — Rastrigin, N-body, Monte Carlo Pi, RL environments, transformer fusion. Every result published, no cherry-picking, medians not means.
Benchmark your GPU →LLM inferencePhi-3-mini (3.8B) running end-to-end via 10 kernel roles across 27 WGSL files, replacing the 85 TVM-autotuned shaders WebLLM needs. ~40 tok/s on M2 Pro, 22% behind WebLLM.
Run it live →VisualizationA real forward pass of Phi-3-mini visualised tensor-by-tensor. 3.8 billion parameters, your GPU, your browser — every glow is a live activation read back from WebGPU. Zero server, zero API key.
Watch it think →QuantumStatevector + MPS quantum simulator running on commodity hardware via WebGPU compute. Six-level research ladder from bandwidth-bound statevector through MPS, kernel fusion, WebRTC swarm, IBM hardware cross-verify, to chemistry/VQE. No CUDA, no install.
Open the simulator →PersonalPersonal site and project hub.
About →Independent Researcher