fix(maniskill): avoid stale offload video counter state#1168
Open
kunni918 wants to merge 1 commit into
Open
Conversation
Signed-off-by: Kun Ni <kunni918@hotmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This removes stale
video_cntstate from_ManiskillEnvCoreoffload serialization/deserialization.video_cntis now owned by the genericRecordVideowrapper, not by the ManiSkill core env. The ManiSkill offload core still tried to save and restore it, so capturing offload state could crash before the state buffer was created.The regression tests load
maniskill_offload_env.pywith lightweight fakes so the core offload-state contract can be checked without requiring ManiSkill/Sapien.Motivation and Context
Found while investigating #114 and the closed unmerged PR #127. This PR is related context only and does not close #114, because the original issue was about video file overwrite behavior and this PR fixes an adjacent stale offload-state bug.
Minimal reproduction
I reproduced the failure on the current base commit
3a7c83eby applying only the new regression test file from this PR toupstream/mainand running the smallest test that exercises the stale state access:Result on
upstream/main: the test fails with:The failing line is:
This happens before a state buffer can be produced, so an offloaded ManiSkill core can fail during
get_state()even when video recording itself is handled outside the core env.The fakes in the test only replace ManiSkill/Sapien dependencies; the test still executes the real
_ManiskillEnvCore.get_state()method from the production file.Root-cause trace
The ownership mismatch comes from two earlier changes:
a500e7f feat: remove EnvManager and add RecordVideo wrapper (#659)moved video recording out of environment classes and intorlinf/envs/wrappers/record_video.py.ManiskillEnv.__init__stopped definingself.video_cnt.ManiskillEnvlost its localrender_images,add_new_frames(), andflush_video()video state/methods.RecordVideo.__init__now definesself.video_cnt = 0.RecordVideo.flush_video()writes{self.video_cnt}.mp4and increments the wrapper-owned counter.EnvWorkerwraps envs withRecordVideo(env, video_cfg)whenvideo_cfg.save_videois enabled.89f71e5 feat(env_offload): support ManiSkillEnv and OpenSoraEnv offload (#727)later reintroducedrlinf/envs/maniskill/maniskill_offload_env.py, but its_ManiskillEnvCorestate schema still included the old env-owned video counter:get_state()serialized"video_cnt": self.video_cnt.load_state()restoredself.video_cnt = state["video_cnt"].On current
main,_ManiskillEnvCoresubclassesManiskillEnv, andManiskillEnvno longer ownsvideo_cnt; the wrapper does. The offload core should therefore serialize simulator/core state only, not wrapper video state.How has this been tested?
Targeted local checks on this branch:
Result:
5 passed.Results: ruff check passed, format check passed, diff whitespace check passed.
I did not run a real ManiSkill/Sapien embodied e2e locally. The added regression tests intentionally avoid those optional heavy dependencies and cover the
_ManiskillEnvCoreoffload state schema directly.Additional information (optional, e.g., figures and logs):
The tests cover three state-schema cases:
get_state()no longer emitsvideo_cnt.load_state()accepts a state buffer withoutvideo_cnt.load_state()ignores a legacy buffer that still contains an extravideo_cntfield.Types of changes
Checklist: