GPU frameworks waste 92% of their time.
I fixed it.

Kernel fusion eliminates per-dispatch overhead by packing entire computations into single GPU instructions. Proven across evolutionary algorithms and transformer inference — with up to 720× speedup. Zero installation. Any browser.

The insight

How frameworks work

dispatch

step 1 → waitdispatch step 2 → wait

... × 1,500 steps = 22,500 round-trips

92% of time = waiting, not computing

Kernel fusion

dispatch once

→ GPU loops internally

1,500 steps in 1 round-trip

100% of time = computing

Research
Published

Single-Kernel Fusion for Sequential Fitness Evaluation

via WebGPU Compute Shaders

720×
CUDA over PyTorch (same T4)
159×
WebGPU over PyTorch (same M2)
4
GPU APIs confirmed

Fusing sequential fitness evaluations into single GPU dispatches eliminates per-step kernel launch overhead. Proven across CUDA, WebGPU, JAX/XLA, and Triton on two hardware platforms.

Published

Single-Kernel Fusion for Autoregressive Transformer Decoding

via WebGPU Compute Shaders

458×
parallel kernel vs unfused (D=256)
161×
over PyTorch MPS (D=32)
16K
tokens/sec in the browser

Browser LLM engines dispatch 1,024 separate GPU kernels per generation. We fuse everything into one dispatch. Single-threaded: 6.6-13.5×. Parallel kernel (64 threads + shared memory): 66-458×. Beats PyTorch MPS by 7.5-161× at all tested sizes up to D=256. 16,410 tok/s at D=32.

Ahmet Baris Gunaydin

Independent Researcher