Skip to content

fix(maniskill): avoid stale offload video counter state#1168

Open
kunni918 wants to merge 1 commit into
RLinf:mainfrom
kunni918:bugfix/maniskill-offload-state
Open

fix(maniskill): avoid stale offload video counter state#1168
kunni918 wants to merge 1 commit into
RLinf:mainfrom
kunni918:bugfix/maniskill-offload-state

Conversation

@kunni918
Copy link
Copy Markdown
Contributor

@kunni918 kunni918 commented May 17, 2026

Description

This removes stale video_cnt state from _ManiskillEnvCore offload serialization/deserialization.

video_cnt is now owned by the generic RecordVideo wrapper, not by the ManiSkill core env. The ManiSkill offload core still tried to save and restore it, so capturing offload state could crash before the state buffer was created.

The regression tests load maniskill_offload_env.py with lightweight fakes so the core offload-state contract can be checked without requiring ManiSkill/Sapien.

Motivation and Context

Found while investigating #114 and the closed unmerged PR #127. This PR is related context only and does not close #114, because the original issue was about video file overwrite behavior and this PR fixes an adjacent stale offload-state bug.

Minimal reproduction

I reproduced the failure on the current base commit 3a7c83e by applying only the new regression test file from this PR to upstream/main and running the smallest test that exercises the stale state access:

python -m pytest tests/unit_tests/test_maniskill_offload_env.py::test_maniskill_offload_state_does_not_require_record_video_counter -q

Result on upstream/main: the test fails with:

E       AttributeError: '_ManiskillEnvCore' object has no attribute 'video_cnt'
rlinf/envs/maniskill/maniskill_offload_env.py:89

The failing line is:

"video_cnt": self.video_cnt,

This happens before a state buffer can be produced, so an offloaded ManiSkill core can fail during get_state() even when video recording itself is handled outside the core env.

The fakes in the test only replace ManiSkill/Sapien dependencies; the test still executes the real _ManiskillEnvCore.get_state() method from the production file.

Root-cause trace

The ownership mismatch comes from two earlier changes:

  1. a500e7f feat: remove EnvManager and add RecordVideo wrapper (#659) moved video recording out of environment classes and into rlinf/envs/wrappers/record_video.py.

    • ManiskillEnv.__init__ stopped defining self.video_cnt.
    • ManiskillEnv lost its local render_images, add_new_frames(), and flush_video() video state/methods.
    • RecordVideo.__init__ now defines self.video_cnt = 0.
    • RecordVideo.flush_video() writes {self.video_cnt}.mp4 and increments the wrapper-owned counter.
    • EnvWorker wraps envs with RecordVideo(env, video_cfg) when video_cfg.save_video is enabled.
  2. 89f71e5 feat(env_offload): support ManiSkillEnv and OpenSoraEnv offload (#727) later reintroduced rlinf/envs/maniskill/maniskill_offload_env.py, but its _ManiskillEnvCore state schema still included the old env-owned video counter:

    • get_state() serialized "video_cnt": self.video_cnt.
    • load_state() restored self.video_cnt = state["video_cnt"].

On current main, _ManiskillEnvCore subclasses ManiskillEnv, and ManiskillEnv no longer owns video_cnt; the wrapper does. The offload core should therefore serialize simulator/core state only, not wrapper video state.

How has this been tested?

Targeted local checks on this branch:

python -m pytest tests/unit_tests/test_maniskill_offload_env.py tests/unit_tests/test_overlap_env_bootstrap.py -q

Result: 5 passed.

python -m ruff check rlinf/envs/maniskill/maniskill_offload_env.py tests/unit_tests/test_maniskill_offload_env.py
python -m ruff format --check rlinf/envs/maniskill/maniskill_offload_env.py tests/unit_tests/test_maniskill_offload_env.py
git diff --check HEAD~1..HEAD

Results: ruff check passed, format check passed, diff whitespace check passed.

I did not run a real ManiSkill/Sapien embodied e2e locally. The added regression tests intentionally avoid those optional heavy dependencies and cover the _ManiskillEnvCore offload state schema directly.

Additional information (optional, e.g., figures and logs):

The tests cover three state-schema cases:

  • get_state() no longer emits video_cnt.
  • load_state() accepts a state buffer without video_cnt.
  • load_state() ignores a legacy buffer that still contains an extra video_cnt field.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Documentation update (Document-only update)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Signed-off-by: Kun Ni <kunni918@hotmail.com>
@kunni918 kunni918 marked this pull request as ready for review May 17, 2026 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Video log was overwritten when set env.enable_offload=True

1 participant