Skip to content

Add device.simulateLoss(), and prevent mappedAtCreation on destroyed devices#5115

Draft
kainino0x wants to merge 1 commit intogpuweb:mainfrom
kainino0x:simulateloss
Draft

Add device.simulateLoss(), and prevent mappedAtCreation on destroyed devices#5115
kainino0x wants to merge 1 commit intogpuweb:mainfrom
kainino0x:simulateloss

Conversation

@kainino0x
Copy link
Contributor

@kainino0x kainino0x commented Mar 21, 2025

EDIT: I propose having both destroy() and simulateLoss() because they are useful for different things. destroy() to clean up resources easily during shutdown, and simulateLoss() for testing application behavior on device loss.

Issue: fixes #5102 (see there for discussion and past minutes), fixes #4177

Copy link

@mwyrzykowski mwyrzykowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this attempts to workaround limitations or inconsistencies in the API by introducing another API function on the GPUDevice. Instead we should address the limitations or inconsistencies without adding new API.

Specifically for the getMappedRange case on a lost device, should we just align the behavior to mapAsync when it is known the device is lost on the content timeline?

@kainino0x
Copy link
Contributor Author

I think this attempts to workaround limitations or inconsistencies in the API by introducing another API function on the GPUDevice. Instead we should address the limitations or inconsistencies without adding new API.

No, my claim is that both destroy() and simulateLoss() are useful for different things. destroy() to clean up resources easily during shutdown, and simulateLoss() for testing application behavior on device loss.

Specifically for the getMappedRange case on a lost device, should we just align the behavior to mapAsync when it is known the device is lost on the content timeline?

This would go against #1629. Of course, mapAsync already does, but I think that's more OK because it's async. That said, we did leave open the possibility of making mapAsync work on lost devices, too: #1629 (comment)
I think we mainly avoided it to simplify implementations, though I don't think the implementation is actually complicated.

Copy link

@mwyrzykowski mwyrzykowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be able to join the meeting today, but API calls which are only for testing application behavior don't seem appropriate for inclusion into the specification. destroy() already results in device lost, so this seems sufficient for testing device lost if needed.

@kainino0x
Copy link
Contributor Author

I don't think it's a problem to provide things that are mainly for testing. Examples:

  • The Web platform requires all platform exception types to be user-constructible. AFAIK the only reason this is globally required is for testing purposes.
  • We probably wouldn't need the ability to catch validation errors using error scopes if not for testing. (See this doc)
  • WebGL has WEBGL_lose_context.

Applications need to be able to test their code in the standard web platform. I don't think it would be reasonable if any or all of these capabilities were hidden behind some special browser flags: code couldn't be tested under the same platform that runs in production, plus everyone who writes software for the web would need to know about this.

@kainino0x
Copy link
Contributor Author

Re: @kdashg's proposal that we change device.destroy()'s behavior so that it doesn't unmap buffers (instead of adding a new thing).

This is possible. It shouldn't have really direct impacts on application behavior. However a lot of WebGPU applications are pushing the resource limits of the system, so may be implicitly relying on device.destroy() to clean up memory used by mappings.

I don't think it's going to be a common problem, because mappings generally shouldn't live that long anyway, but in particular patterns like the "queue of mapped buffers" for data upload could have several large mappings alive that would no longer be cleaned up promptly, until applications update their code to clean these up explicitly.

@kainino0x
Copy link
Contributor Author

Another side note: I was wondering if there's any conflict with triple-mapping, if keeping the mapping alive could have costs beyond just the raw memory allocation. I think there is no direct problem, since it always has to be safe anyway for the entire GPU process to crash. But I think triple-mapped buffers could be allocated in different physical memory spaces that are more memory-constrained than regular mappings, making cleanup a bit more important.

@Kangz
Copy link
Contributor

Kangz commented Mar 27, 2025

GPU Web WG 2025-03-25/26 Pacific-time
  • KN: addresses last week's discussion.
  • GT: think simulateLoss might be good, but not sure I know enough about how loss works to know how to use it. In WebGL loss can happen at any command. Adding mappedAtCreation failing will throw everything off.
  • KN: no. Only prevents it on devices that've been destroyed. simulateLoss does the same thing as a natural device loss. Wouldn't stop you from creating buffers mappedAtCreation. Both stop you from mapping stuff asynchronously. Bit unfortunate. Talked about faking the mapping if the device is lost; didn't do that because it'd be more work for impls to fake mappings.
  • KN: don't think it's very hard to do so might want to consider doing it at some point.
  • KG: my concerns are half-similar to Mike's on the PR. Would rather have simulateLoss and not destroy() - that's what we have in WebGL. Think destroy is less important than device loss.
  • KN: only difference is in buffer mapping, which doesn't exist in WebGL.
  • KG: if you wanted that to not happen, try harder in the impl, I'd say. More imp't for impls to figure out whether device loss will cause problems with the app, then to make it slightly simpler for them to make a mapped buffer when the other side's destroyed the context. Bunch of ways you can monkey–patch and implement destroy yourself. Have to give you the most things you can't do yourself.
  • KN: would you then propose we change the current behavior of destroy() so it doesn't unmap buffers?
  • KG: I think that'd be great.
  • KR: need to talk with partners, make sure that they aren't surprised that we aren't cleaning up their memory. Google Meet has already raised this issue at the Wasm level.
  • KN: I'll write something up.

@kainino0x kainino0x added the needs-cts-issue This change requires tests (or would need tests if accepted), but may not have a CTS issue filed yet label Mar 28, 2025
@Kangz Kangz modified the milestones: Milestone 1, Milestone 2 Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api WebGPU API needs-cts-issue This change requires tests (or would need tests if accepted), but may not have a CTS issue filed yet proposal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec requires mappedAtCreation to still work after device is destroyed device.destroy() cannot simulate real device loss

3 participants