@hackage webgpu-dawn0.1.1.0

Haskell bindings to WebGPU Dawn for GPU computing and graphics

webgpu-dawn

High-level, type-safe Haskell bindings to Google's Dawn WebGPU implementation.

This library enables portable GPU computing with a Production-Ready DSL designed for high-throughput inference (e.g., LLMs), targeting 300 TPS (Tokens Per Second) performance.


⚡ Core Design Principles

To achieve high performance and type safety, this library adheres to the following strict patterns:

  1. Type-Safe Monadic DSL: No raw strings. We use ShaderM for composability and type safety.
  2. Natural Math & HOAS: Standard operators (+, *) and Higher-Order Abstract Syntax (HOAS) for loops (loop ... $ \i -> ...).
  3. Profile-Driven: Performance tuning is based on Roofline Analysis.
  4. Async Execution: Prefer AsyncPipeline to hide CPU latency and maximize GPU occupancy.
  5. Hardware Acceleration: Mandatory use of Subgroup Operations and F16 precision for heavy compute (MatMul/Reduction).

🏎️ Performance & Profiling

We utilize a Profile-Driven Development (PDD) workflow to maximize throughput.

1. Standard Benchmarks & Roofline Analysis

Run the optimized benchmark to determine TFLOPS and check the Roofline classification (Compute vs Memory Bound).

# Run 2D Block-Tiling MatMul Benchmark (FP32)
cabal run bench-optimized-matmul -- --size 4096 --iters 50

Output Example:

[Compute]  137.4 GFLOPs
[Memory]   201.3 MB
[Status]   COMPUTE BOUND (limited by GPU FLOPs)
[Hint]     Use F16 and Subgroup Operations to break the roofline.

2. Visual Profiling (Chrome Tracing)

Generate a trace file to visualize CPU/GPU overlap and kernel duration.

cabal run bench-optimized-matmul -- --size 4096 --trace

  • Load: Open chrome://tracing or ui.perfetto.dev
  • Analyze: Import trace.json to identify gaps between kernel executions (CPU overhead).

3. Debugging

Use the GPU printf-style debug buffer to inspect values inside kernels.

-- In DSL:
debugPrintF "intermediate_val" val


🚀 Quick Start

1. High-Level API (Data Parallelism)

Zero boilerplate. Ideal for simple map/reduce tasks.

import WGSL.API
import qualified Data.Vector.Storable as V

main :: IO ()
main = withContext $ \ctx -> do
  input  <- toGPU ctx (V.fromList [1..100] :: V.Vector Float)
  result <- gpuMap (\x -> x * 2.0 + 1.0) input
  out    <- fromGPU' result
  print out

2. Core DSL (Explicit Control)

Required for tuning Shared Memory, Subgroups, and F16.

import WGSL.DSL

shader :: ShaderM ()
shader = do
  input  <- declareInputBuffer "in" (TArray 1024 TF16)
  output <- declareOutputBuffer "out" (TArray 1024 TF16)
   
  -- HOAS Loop: Use lambda argument 'i', NOT string "i"
  loop 0 1024 1 $ \i -> do
    val <- readBuffer input i
    -- f16 literals for 2x throughput
    let res = val * litF16 2.0 + litF16 1.0
    writeBuffer output i res


📚 DSL Syntax Cheatsheet

Types & Literals

Haskell Type WGSL Type Literal Constructor Note
Exp F32 f32 litF32 1.0 or 1.0 Standard float
Exp F16 f16 litF16 1.0 Half precision (Fast!)
Exp I32 i32 litI32 1 or 1 Signed int
Exp U32 u32 litU32 1 Unsigned int
Exp Bool_ bool litBool True Boolean

Casting Helpers: i32(e), u32(e), f32(e), f16(e)

Control Flow (HOAS)

-- For Loop
loop start end step $ \i -> do ...

-- If Statement
if_ (val > 10.0) 
    (do ... {- then block -} ...) 
    (do ... {- else block -} ...)

-- Barrier
barrier  -- workgroupBarrier()


🧩 Kernel Fusion

For maximum performance, fuse multiple operations (Load -> Calc -> Store) into a single kernel to reduce global memory traffic.

import WGSL.Kernel

-- Fuse: Load -> Process -> Store
let pipeline = loadK inBuf >>> mapK (* 2.0) >>> mapK relu >>> storeK outBuf

-- Execute inside shader
unKernel pipeline i


📚 Architecture & Modules

Execution Model (Latency Hiding)

To maximize GPU occupancy, encoding is separated from submission.

  • WGSL.Async.Pipeline: Use for main loops. Allows CPU to encode Token N+1 while GPU processes Token N.
  • WGSL.Execute: Low-level synchronous execution (primarily for debugging).

Module Guide

Feature Module Description
Subgroup Ops WGSL.DSL subgroupMatrixLoad, mma, subgroupMatrixStore
F16 Math WGSL.DSL litF16, vec4<f16> for 2x throughput
Structs WGSL.Struct Generic derivation for std430 layout compliance
Analysis WGSL.Analyze Roofline analysis logic

📦 Installation

Pre-built Dawn binaries are downloaded automatically during installation.

cabal install webgpu-dawn


License

MIT License - see LICENSE file for details.

Acknowledgments

  • Dawn (Google): Core WebGPU runtime.
  • gpu.cpp (Answer.AI): High-level C++ API wrapper inspiration.
  • GLFW: Window management.

Contact

Maintainer: Junji Hashimoto junji.hashimoto@gmail.com