-
Notifications
You must be signed in to change notification settings - Fork 363
Description
TL;DR: WebGPU should preserve the order of all side effects (transfer read/writes, pixel read/writes) except for Unordered Access Views, which are only synchronized at the render/compute pass boundaries.
Introduction
Graphics hardware provides certain guarantees about the order of operations. We don't observe the actual execution order of the shaders, since they are largely executed in parallel, but we can observe their side effects, such as output colors written to the texture targets.
In a graphics pass, the only side effect that is not ordered is Unordered Access Views (or UAV in short - the term comes from DirectX, while GL land calls it shader storage buffers - SSBO) writes. All the other side effects are ordered according to the primitive submission order (draw call order -> instance order -> primitive order):
- RTV and DSV during rasterization (pixel, depth, and stencil writes)
- note: AMD_rasterization_order allows to remove the ordering guarantees from raster operations in order to get a 5% performance gain. We should be able to figure out internally if doing so introduces any (additional) data races without exposing it to the user.
- stream output (i.e. transform feedback)
- Ordered Access Views (OAV)
In a compute pass, UAV is the only way to get something out, so there is nothing left to be ordered. In transfer operations, read-write and write-write hazards are possible, and pipeline barriers are required to serialize those.
UAV
The mechanics of a UAV in D3D12 and Vulkan is such that the user is expected to place memory barriers if they want to serialize the side effects. In Metal, UAV are forcefully serialized at the draw call granularity in compute passes, and at the render pass granularity otherwise.
It's at the core of an UAVs to produce data races, and that is what allows the hardware to read/write them efficiently (no need to synchronize/serialize access). Therefore, for performance/efficiency reasons we don't believe that it's worth trying to enforce synchronization at a finer level than the draw calls. The cost of draw call-level synchronization is also expected to be unacceptably high for render operations, since a tiling GPU would have to flush the whole tile before proceeding after such a barrier. For this reason, Vulkan supports only a very limited set of pipeline barriers inside render passes.
Proposal
Document UAVs as a special kind of resource view that has a wide synchronization scope - the render/compute pass boundary. Any dependent operations within this scope are then considered non-portable, although it would be hard (if possible at all) for an implementation to detect those and warn appropriately.
Note that the proposal is based on the constraint that each resource would have to be only in either a single writable state or a combination of readable states during a pass. This automatically prevents a situation where the user would want to write to an UAV and then re-bind as an SRV/CBV within a pass.
For transfer operations, the API knows precise resources affected and their ranges, since those are explicitly provided by the user for copy/blit calls. Therefore, an implementation can figure out the possible hazards and insert appropriate barriers automatically. It doesn't have to be smart, could just optimize later by removing some of the barriers it considers unnecessary. For this reason, grouping operations into a "transfer/copy pass" does not appear to bear much of a value, and we think the group should reconsider having those passes in the API.
Issues
Why not insert automatic barriers between compute dispatches like in Metal?
Mainly because it's not consistent with render passes. If compute UAV side effects are synchronized at the dispatch boundary, then the users will seek ways to avoid hitting that synchronization point in cases where their use of an UAV is guaranteed to be portable at the logic level that is not visible to WebGPU implementation. These ways could consist of trying to build mega-shaders that do many operations at once, which is counter to what they'd do in Vulkan/DX12 and not productive (working around the API instead of taking the benefit of it). If the UAVs are synchronized at the pass barriers, the users always have an option to break a pass (and start a new one) if they need to depend on previous writes.
In Metal, automatic barriers made more sense because there is no constraint on a resource usage being static across a pass. If we do this in WebGPU, then we'd need to reconsider the static usage constaints, and it would hurt optimal performance on Vulkan and D3D12 backends.