@hackage webgpu-dawn0.1.1.0

Haskell bindings to WebGPU Dawn for GPU computing and graphics

Categories
- Graphics
License
MIT
Maintainer
junji.hashimoto@gmail.com
Links
Versions
- 0.1.1.0 Tue, 30 Dec 2025
- 0.1.0.0 Thu, 18 Dec 2025

Installation
Custom
In your cabal file:
Dependencies (13)
- aeson >=2.0 && <2.3
- base >=4.14 && <5
- base64-bytestring >=1.0 && <1.3
- binary >=0.8 && <0.9
- bytestring >=0.10 && <0.13
- containers >=0.6 && <0.8
Dependents (0)

Package Flags

glfw

(on by default)

Enable GLFW support for windowed graphics applications

webgpu-dawn

High-level, type-safe Haskell bindings to Google's Dawn WebGPU implementation.

This library enables portable GPU computing with a Production-Ready DSL designed for high-throughput inference (e.g., LLMs), targeting 300 TPS (Tokens Per Second) performance.

⚡ Core Design Principles

To achieve high performance and type safety, this library adheres to the following strict patterns:

Type-Safe Monadic DSL: No raw strings. We use ShaderM for composability and type safety.
Natural Math & HOAS: Standard operators (+, *) and Higher-Order Abstract Syntax (HOAS) for loops (loop ... $ \i -> ...).
Profile-Driven: Performance tuning is based on Roofline Analysis.
Async Execution: Prefer AsyncPipeline to hide CPU latency and maximize GPU occupancy.
Hardware Acceleration: Mandatory use of Subgroup Operations and F16 precision for heavy compute (MatMul/Reduction).

🏎️ Performance & Profiling

We utilize a Profile-Driven Development (PDD) workflow to maximize throughput.

1. Standard Benchmarks & Roofline Analysis

Run the optimized benchmark to determine TFLOPS and check the Roofline classification (Compute vs Memory Bound).

# Run 2D Block-Tiling MatMul Benchmark (FP32)
cabal run bench-optimized-matmul -- --size 4096 --iters 50

Output Example:

[Compute]  137.4 GFLOPs
[Memory]   201.3 MB
[Status]   COMPUTE BOUND (limited by GPU FLOPs)
[Hint]     Use F16 and Subgroup Operations to break the roofline.

2. Visual Profiling (Chrome Tracing)

Generate a trace file to visualize CPU/GPU overlap and kernel duration.

cabal run bench-optimized-matmul -- --size 4096 --trace

Load: Open chrome://tracing or ui.perfetto.dev
Analyze: Import trace.json to identify gaps between kernel executions (CPU overhead).

3. Debugging

Use the GPU printf-style debug buffer to inspect values inside kernels.

-- In DSL:
debugPrintF "intermediate_val" val

🚀 Quick Start

1. High-Level API (Data Parallelism)

Zero boilerplate. Ideal for simple map/reduce tasks.

import WGSL.API
import qualified Data.Vector.Storable as V

main :: IO ()
main = withContext $ \ctx -> do
  input  <- toGPU ctx (V.fromList [1..100] :: V.Vector Float)
  result <- gpuMap (\x -> x * 2.0 + 1.0) input
  out    <- fromGPU' result
  print out

2. Core DSL (Explicit Control)

Required for tuning Shared Memory, Subgroups, and F16.

import WGSL.DSL

shader :: ShaderM ()
shader = do
  input  <- declareInputBuffer "in" (TArray 1024 TF16)
  output <- declareOutputBuffer "out" (TArray 1024 TF16)
   
  -- HOAS Loop: Use lambda argument 'i', NOT string "i"
  loop 0 1024 1 $ \i -> do
    val <- readBuffer input i
    -- f16 literals for 2x throughput
    let res = val * litF16 2.0 + litF16 1.0
    writeBuffer output i res

📚 DSL Syntax Cheatsheet

Types & Literals

Haskell Type	WGSL Type	Literal Constructor	Note
`Exp F32`	`f32`	`litF32 1.0` or `1.0`	Standard float
`Exp F16`	`f16`	`litF16 1.0`	Half precision (Fast!)
`Exp I32`	`i32`	`litI32 1` or `1`	Signed int
`Exp U32`	`u32`	`litU32 1`	Unsigned int
`Exp Bool_`	`bool`	`litBool True`	Boolean

Casting Helpers: i32(e), u32(e), f32(e), f16(e)

Control Flow (HOAS)

-- For Loop
loop start end step $ \i -> do ...

-- If Statement
if_ (val > 10.0) 
    (do ... {- then block -} ...) 
    (do ... {- else block -} ...)

-- Barrier
barrier  -- workgroupBarrier()

🧩 Kernel Fusion

For maximum performance, fuse multiple operations (Load -> Calc -> Store) into a single kernel to reduce global memory traffic.

import WGSL.Kernel

-- Fuse: Load -> Process -> Store
let pipeline = loadK inBuf >>> mapK (* 2.0) >>> mapK relu >>> storeK outBuf

-- Execute inside shader
unKernel pipeline i

📚 Architecture & Modules

Execution Model (Latency Hiding)

To maximize GPU occupancy, encoding is separated from submission.

WGSL.Async.Pipeline: Use for main loops. Allows CPU to encode Token N+1 while GPU processes Token N.
WGSL.Execute: Low-level synchronous execution (primarily for debugging).

Module Guide

Feature	Module	Description
Subgroup Ops	`WGSL.DSL`	`subgroupMatrixLoad`, `mma`, `subgroupMatrixStore`
F16 Math	`WGSL.DSL`	`litF16`, `vec4<f16>` for 2x throughput
Structs	`WGSL.Struct`	`Generic` derivation for `std430` layout compliance
Analysis	`WGSL.Analyze`	Roofline analysis logic

📦 Installation

Pre-built Dawn binaries are downloaded automatically during installation.

cabal install webgpu-dawn

License

MIT License - see LICENSE file for details.

Acknowledgments

Dawn (Google): Core WebGPU runtime.
gpu.cpp (Answer.AI): High-level C++ API wrapper inspiration.
GLFW: Window management.

Contact

Maintainer: Junji Hashimoto junji.hashimoto@gmail.com

@hackage webgpu-dawn0.1.1.0

Categories

License

Maintainer

Links

Versions

InstallationCustom

Dependencies (13)

Dependents (0)

Package Flags

webgpu-dawn

⚡ Core Design Principles

🏎️ Performance & Profiling

1. Standard Benchmarks & Roofline Analysis

2. Visual Profiling (Chrome Tracing)

3. Debugging

🚀 Quick Start

1. High-Level API (Data Parallelism)

2. Core DSL (Explicit Control)

📚 DSL Syntax Cheatsheet

Types & Literals

Control Flow (HOAS)

🧩 Kernel Fusion

📚 Architecture & Modules

Execution Model (Latency Hiding)

Module Guide

📦 Installation

License

Acknowledgments

Contact

Installation
Custom