http2: fix zombie session crash on socket close by suuuuuuminnnnnn · Pull Request #61702 · nodejs/node

suuuuuuminnnnnn · 2026-02-06T05:03:58Z

Summary

This PR fixes a crash caused by HTTP/2 zombie sessions when the underlying socket is dead but the HTTP/2 session remains “alive” in Node.js.

The underlying socket can become closed at the OS level without Node.js receiving a close event (e.g. packet drop “black hole”).
Subsequent writes can hit internal invariants and crash the process:
- CHECK(is_write_in_progress()) in Http2Session::OnStreamAfterWrite
- CHECK(!current_write_) in TLSWrap::DoWrite
Fix: Close the HTTP/2 session when a read error occurs (nread < 0) in Http2Session::OnStreamRead.

Fixes: #61304

What is the current behavior?

When the network enters a “black hole” state (packets dropped without RST/FIN), the OS may consider the TCP socket dead/closed, but Node.js can fail to observe the close.

In this case:

The HTTP/2 session stays in a zombie state:
- session.closed === false
- session.destroyed === false
- outbound queue grows (session.state.outboundQueueSize increases continuously)
Later write attempts can break invariants and crash the process via assertions.

Relevant code path (before):

// Only pass data on if nread > 0
if (nread <= 0) {
  if (nread < 0) {
    PassReadErrorToPreviousListener(nread);
  }
  return;  // session not closed -> zombie possible
}

What is the new behavior?

On read error (nread < 0), we still notify the previous listener, then close the HTTP/2 session to prevent zombie state and subsequent assertion crashes.

if (nread <= 0) {
  if (nread < 0) {
    PassReadErrorToPreviousListener(nread);
    // Close the session to prevent zombie state when the underlying socket is dead.
    Close(NGHTTP2_NO_ERROR, true);
  }
  return;
}

Key points:

Minimal change: only affects the read-error path.
Close() is idempotent / safe to call redundantly.
Preserves callback ordering (previous listener notified first).

Why is this correct?

A negative nread indicates an error/EOF condition at the stream/socket read layer. Keeping the HTTP/2 session alive after a read error allows the session to continue queueing outbound frames while the transport is effectively dead, which eventually leads to inconsistent internal write state and assertions.

Closing the session on read error restores the expected lifecycle invariant: dead transport ⇒ closed session.

How was this tested?

New test (CI-friendly)

Adds a new parallel test that reproduces the “zombie” behavior without OS firewall rules by destroying the underlying socket directly:

client[kSocket].destroy() to kill the transport.
Attempt follow-up requests.
Verify: no process crash, session cleans up, and the close path is taken.

New file:

test/parallel/test-http2-zombie-session-61304.js

Commands

Built locally (macOS arm64):

./configure
make -j1

Ran test:

out/Release/node test/parallel/test-http2-zombie-session-61304.js

Spot-checked a related HTTP/2 test:

out/Release/node test/parallel/test-http2-client-socket-destroy.js

Additional context / reproduction (original issue)

The original issue can be reproduced reliably on macOS using pf firewall rules to drop packets (reported 100% reproduction), creating a network “black hole” where the socket becomes dead without Node observing a clean close.

Issue: #61304

Files changed

src/node_http2.cc (small change: close session on nread < 0)
test/parallel/test-http2-zombie-session-61304.js (new)

nodejs-github-bot · 2026-02-06T05:04:04Z

Review requested:

@nodejs/http2
@nodejs/net

codecov · 2026-02-06T11:31:59Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.72%. Comparing base (ae2ffce) to head (56c6281).
⚠️ Report is 24 commits behind head on main.

Files with missing lines	Patch %	Lines
src/node_http2.cc	0.00%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #61702      +/-   ##
==========================================
- Coverage   89.75%   89.72%   -0.03%     
==========================================
  Files         674      675       +1     
  Lines      204416   204530     +114     
  Branches    39285    39313      +28     
==========================================
+ Hits       183472   183517      +45     
- Misses      13227    13287      +60     
- Partials     7717     7726       +9

Files with missing lines	Coverage Δ
src/node_http2.cc	`81.84% <0.00%> (-0.01%)`	⬇️

... and 47 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mcollina

lgtm

suuuuuuminnnnnn · 2026-02-06T13:27:44Z

Hi @mcollina, I pushed a lint fix commit but CI didn’t re-run—could you please re-trigger the checks for this PR?

github-actions · 2026-02-06T13:41:15Z

Failed to start CI

   ⚠  Commits were pushed since the last approving review:
   ⚠  - http2: fix zombie session crash on socket close
   ⚠  - http2: prevent assertion failure in OnStreamAfterWrite
   ⚠  - http2: handle write callback gracefully in zombie sessions
   ✘  Refusing to run CI on potentially unsafe PR

https://github.com/nodejs/node/actions/runs/21752594207

pimterry

Thanks for the contribution @suuuuuuminnnnnn!

It looks like you've removed the fix that was described in the README here as part of your changes (in 6da4fe9) and now in the complete changeset it's just disabling an assertion. Is that intentional?

In the test, I think there's a little more work required to properly reproduce the issue described. I've just checked, and it passes on my machine with current versions of Node, without your src changes. The test should fail without your fix, and then pass once the fix is applied.

It would also be good to avoid the 1 second setTimeout here - timeouts like that are generally a fragile solution to race conditions and collectively they make our test suite much slower. It's better to use events & callbacks to cleanup at the correct time instead of guessing a duration. Once you have a failing test & working fix for it, let me know if you need help finding a good approach to avoid that.

suuuuuuminnnnnn · 2026-02-07T05:36:50Z

Thanks! @pimterry

The original README fix via OnStreamRead was the wrong path for this issue (black-hole/zombie may not produce a read error and it also affects normal shutdown).
I moved the fix to where the crash happens: Http2Session::OnStreamAfterWrite now detects the inconsistent state (!is_write_in_progress()) and terminates the session via Close() instead of just returning / disabling assertions.
The test is now fail-first and avoids setTimeout(1000) by using event-based cleanup (100ms fallback only).

Let me know if you want the fallback timer removed entirely — I can tighten it further.

mcollina

lgtm

nodejs-github-bot · 2026-02-08T21:55:17Z

CI: https://ci.nodejs.org/job/node-test-pull-request/71261/

pimterry

@suuuuuuminnnnnn The new test still passes for me locally, without your changes. We need a test that fails on current Node, and then a change that fixes it. I'm testing with Node v25.6.0 (latest release) and v24.13.0, but I suspect it works on all recent versions. If it does fail for you, can you confirm exactly what version you're using and how you're running the test?

This is also a bit unusual that there's multiple 'success' scenarios in the test code here: async error event, sync thrown error, or no error at all. We would usually want just one behaviour, unless there's a very good reason for unpredictable results. In my case I don't see the error happening at all in any cases, even if I run the test in a loop a few hundred times - it always just exits cleanly with no response.

I think it really should error in this case somehow, because the request really has failed (the socket is unexpectedly dead) and we should make that clear. From the description, it sounds like you're saying that we can reliably detect this failure case after a write completes, so we should be able to enforce that: make the request write some data, use the correct after-write checks to detect the issue, and raise an 'error' event if those checks fail.

If that's correct, that approach would mean we always emit an error async, in which case we can remove the other cases from the test: use mustCall on the 'error' handler, and remove the catch. We could then also run the cleanup code at that point (e.g. inside the 'error' handler) so we could remove the setTimeout completely.

Of course, I haven't dug into this as much as you, so I may be missing some details that make this impossible. If so can you explain more about what's happening here, and the process of what you've tried for detecting this issue, and what's working or not?

suuuuuuminnnnnn · 2026-02-09T11:41:47Z

@pimterry Thanks for the review.

I tried hard to turn this into a fail-first regression test, but I can’t reproduce any failure on current Node (v25.6.0 / v24.13.0), and the test keeps passing even without the src changes. That means I don’t have a valid regression test here, so I’m going to drop this PR for now.

If someone has a known reproduction on current releases (exact script/steps), I’m happy to re-open with a deterministic failing test + fix.

http2: fix zombie session crash on socket close

020578f

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. http2 Issues or PRs related to the http2 subsystem. needs-ci PRs that need a full CI run. labels Feb 6, 2026

http2: prevent assertion failure in OnStreamAfterWrite

6da4fe9

mcollina approved these changes Feb 6, 2026

View reviewed changes

mcollina added the request-ci Add this label to start a Jenkins CI on a PR. label Feb 6, 2026

http2: handle write callback gracefully in zombie sessions

780ec8a

github-actions bot added request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed. and removed request-ci Add this label to start a Jenkins CI on a PR. labels Feb 6, 2026

http2: handle write callback gracefully in zombie sessions

695b5fe

pimterry requested changes Feb 6, 2026

View reviewed changes

http2: fix zombie session crash on socket close

9706c30

http2: fix zombie session crash on socket close

56c6281

mcollina approved these changes Feb 8, 2026

View reviewed changes

mcollina added request-ci Add this label to start a Jenkins CI on a PR. and removed request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed. labels Feb 8, 2026

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Feb 8, 2026

pimterry requested changes Feb 9, 2026

View reviewed changes

http2: defer zombie session cleanup to avoid reentrancy

8d9dacc

suuuuuuminnnnnn closed this Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

http2: fix zombie session crash on socket close#61702

http2: fix zombie session crash on socket close#61702
suuuuuuminnnnnn wants to merge 7 commits intonodejs:mainfrom
suuuuuuminnnnnn:fix-http2-zombie-session

suuuuuuminnnnnn commented Feb 6, 2026

Uh oh!

nodejs-github-bot commented Feb 6, 2026

Uh oh!

codecov bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

mcollina left a comment

Uh oh!

suuuuuuminnnnnn commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

pimterry left a comment

Uh oh!

suuuuuuminnnnnn commented Feb 7, 2026

Uh oh!

mcollina left a comment

Uh oh!

nodejs-github-bot commented Feb 8, 2026

Uh oh!

pimterry left a comment

Uh oh!

suuuuuuminnnnnn commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

suuuuuuminnnnnn commented Feb 6, 2026

Summary

What is the current behavior?

What is the new behavior?

Why is this correct?

How was this tested?

New test (CI-friendly)

Commands

Additional context / reproduction (original issue)

Files changed

Uh oh!

nodejs-github-bot commented Feb 6, 2026

Uh oh!

codecov bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

suuuuuuminnnnnn commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

pimterry left a comment

Choose a reason for hiding this comment

Uh oh!

suuuuuuminnnnnn commented Feb 7, 2026

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

nodejs-github-bot commented Feb 8, 2026

Uh oh!

pimterry left a comment

Choose a reason for hiding this comment

Uh oh!

suuuuuuminnnnnn commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Feb 6, 2026 •

edited

Loading