fix: handle nested ExecutorRun in fake_aminsertcleanup (#4843)#4924
fix: handle nested ExecutorRun in fake_aminsertcleanup (#4843)#4924Nihal-Pandey-2302 wants to merge 11 commits into
Conversation
philippemnoel
left a comment
There was a problem hiding this comment.
Can you write a test for this please?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4924 +/- ##
==========================================
+ Coverage 76.83% 77.06% +0.23%
==========================================
Files 200 199 -1
Lines 54468 54293 -175
==========================================
- Hits 41852 41843 -9
+ Misses 12616 12450 -166
🚀 New features to boost your workflow:
|
b03aeaa to
4fbce68
Compare
|
@philippemnoel I added the doc comment on the stack invariant and a regression guard in I tried to write a dedicated test to trigger the nested executor panic but ran into tooling walls. Standard PL/pgSQL or I also tried writing a Rust-level FFI test to mock Since we can't reliably mock the FFI pointers in the test harness, and standard SQL won't nest the executor, the regression guard is the safest bet. @Pandaaaa906 has the exact |
|
That does not sound ideal. Without a real test, how can we guarantee that this feature won't regress? |
|
@philippemnoel I completely agree it's not ideal. It is never comfortable flying blind on a regression. To put the long-term risk into perspective: this bug specifically lives in If we must have an automated guarantee for this PG15-17 shim, the only reliable path forward is to build a custom It is absolutely doable, but it requires wiring up some heavy, test-specific FFI scaffolding just to validate a legacy polyfill. Given this only affects the pre-PG18 shim, do you feel that test scaffolding is worth the maintenance overhead? Or are we okay relying on the happy-path regression guard + the user's manual validation for this specific edge case? I am fully happy to build out the SPI test if you prefer the strict guarantee! |
3d03468 to
6463970
Compare
|
Done , went ahead and built it out. The @philippemnoel Added the regression test in the latest commit. A Without the depth-counter fix this panics. With the fix: |
|
@philippemnoel Retracting my previous comment , the To summarize everything I tried:
The only real path to an automated test is a custom C extension that calls What the branch does have instead: the doc comment on If you want the C extension route I'll build it, just say the word. Otherwise I'd argue the documented invariant + reporter validation is the right tradeoff for a legacy shim with a known deprecation timeline. |
83c178c to
c757570
Compare
|
That sounds reasonable but then we would be a confirmation that a manual test was run and that this is correct |
@philippemnoel Agreed. I've tagged @Pandaaaa906 directly on the issue asking them to validate the fix against their bingo setup. Once they confirm it resolves the panic on their end, that serves as the manual test confirmation. |
It would be preferrable if you could come up with a script that tests with bingo.cansmile, even if we don't commit it as a test. |
|
@philippemnoel Makes total sense. As requested, I won't commit this to the repo to avoid polluting the CI environments that don't have -- Validation script for issue #4843
-- Requirements: ParadeDB (with pg_search), bingo extension, PostgreSQL 16
-- Run via: psql -d <db_name> -f bingo_validation.sql
CREATE EXTENSION IF NOT EXISTS bingo;
CREATE TABLE bingo_nested_exec_test (
id SERIAL PRIMARY KEY,
smiles TEXT NOT NULL
);
-- The bingo.cansmiles expression evaluation triggers the internal SPI call
-- which forces the nested ExecutorRun_hook
CREATE INDEX bingo_nested_exec_test_bm25_idx
ON bingo_nested_exec_test
USING bm25 (id, (bingo.cansmiles(smiles)::pdb.simple))
WITH (key_field = 'id');
-- Expected Before Fix: PANIC "entered unreachable code: get_insert_state"
-- Expected After Fix: INSERT 0 5
INSERT INTO bingo_nested_exec_test (smiles)
SELECT repeat('C', n)
FROM generate_series(1, 5) AS n;
-- Expected: 5
SELECT COUNT(*) AS row_count FROM bingo_nested_exec_test;
DROP TABLE bingo_nested_exec_test;I have tagged the issue reporter so they can run this exact snippet against their local |
That's a pretty small test. I think we should install Bingo in CI on PG16 and then have this as a test as I suggested initially. |
@philippemnoel I can definitely wire this up, but I want to flag the infrastructure cost before I modify the workflow files. The If you are comfortable adding a manually downloaded, third-party binary step to the CI pipeline just to test this PG15-17 polyfill, I will go ahead and build it! Let me know if you are good with that repo-hygiene tradeoff, and I'll push the workflow changes. |
That's probably just a few seconds, no? Let's only run it on PG16 or whatever is the last version that requires this fix. |
@philippemnoel Sounds like a plan. I will scope the bingo installation and the regression script strictly to the PG16 matrix run so we don't bloat the other test environments. I'll push the workflow updates shortly! |
|
@philippemnoel Hit an architecture blocker on the CI integration. Bingo only ships prebuilt Linux binaries for The problem: all PG16 system matrix runs use That said, I did validate the full fix end-to-end locally on x86_64 PG16 (system Postgres, not pgrx). Here's exactly what I ran:
Output: No panic. The fix is confirmed working against real bingo on real PG16 x86_64. To run this in CI we'd need to add a dedicated PG16 amd64 matrix entry (same pattern as the existing PG18 amd64 include). That adds one extra runner to the PR matrix. Is that acceptable, or would you prefer to keep the regression script as a manual validation artifact and merge as-is? |
f92e3f1 to
04efbc7
Compare
## Summary - SchemaBot was failing with phantom "missing migration" errors on PRs from forks (e.g. #4924). The workflow checks out the PR head's repo and then runs `git checkout ${{ pull_request.base.ref }}` for the "old" schema — which on a fork PR resolves to the fork's own (often stale) `main`. The diff then surfaces every entity added to upstream `main` since the fork was last synced, and additionally downgrades cargo-pgrx if the fork still pins an older version. Fetch the base ref from the PR's base repo and check that out instead, so the diff is always against the real upstream main. - Adjacent fix: `FIRST_PGRX_VERSION` was being set from `${PGRX_VERSION}`, an undefined env var, so it was always empty and the version-mismatch reinstall path in the base-checkout step always fired. Source the value from `steps.pgrx.outputs.version` (already extracted at the top of the job) instead. Without this the "did pgrx change between base and head?" guard never short-circuits. ## Test plan - [x] Re-run SchemaBot on #4924 (fork PR) and confirm it now passes without the contributor needing to rebase. - [x] Confirm SchemaBot still passes on an in-org PR (sanity check that the `upstream` remote dance doesn't break the same-repo case). - [x] Confirm the cargo-pgrx reinstall step is now skipped when base and head pin the same pgrx version (look for absence of "Replaced package cargo-pgrx" in the log).
- Adds depth counter to ExecutorRunEntry to prevent shadowing during nested runs. - Adds stack nesting invariant doc comment. - Adds SQL regression guard for multi-row expression-index inserts. - Suppresses clippy needless_update warning in joinscan.
Add a #[pg_test] that reproduces the nesting pattern from the bug without requiring the bingo extension. A #[pg_extern] function (spi_identity) runs Spi::run() internally. When used as a BM25 expression index, evaluating it during a multi-row INSERT fires a nested ExecutorRun_hook while the outer stack slot is still live — the same code path that bingo.cansmiles triggers via its internal C++ SPI calls. Without the depth-counter fix this panics with: entered unreachable code: get_insert_state 185 passed, 0 failed on pg16.
04efbc7 to
a55fde2
Compare
mdashti
left a comment
There was a problem hiding this comment.
A couple of thoughts from the review side, dropping them here since they're conversational rather than line-specific:
On the bingo CI question — the arm64 issue is real but I don't think it's actually a blocker. The matrix already has a PG18 amd64 entry; mirroring that for PG16 (the only version that needs this shim) gets you bingo coverage without touching the arm64 runs. A few seconds of curl + a one-time install in the workflow is the cost, and it scopes cleanly to the PG16 row.
Separately, @Pandaaaa906's manual confirmation against their bingo setup is the load-bearing evidence on this PR right now. Can we add a repro in the tests?
| execute_once: bool, | ||
| ) { | ||
| if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() { | ||
| entry.depth += 1; |
There was a problem hiding this comment.
How confident are we that nested ExecutorRun calls always pair 1:1 with ExecutorFinish? Cursors and some SPI shapes can fire Run multiple times against the same QueryDesc with a single trailing Finish. If a path like that goes through this branch, depth gets incremented but never fully drained, and the outer Finish skips cleanup. Worth tracing what happens with a SELECT inside a SECURITY DEFINER, or a cursor-bound query inside the index expression, before we land this. Would love to see a RAII mechanism here.
There was a problem hiding this comment.
The 1:1 assumption holds for this specific path because the depth counter only increments when the top of the stack is already Some(...), meaning aminsert has already pushed an InsertState. That only happens inside a DML executor cycle. A cursor firing ExecutorRun multiple times against the same QueryDesc would be a scan, not a DML - it would never call aminsert, never push an InsertState, and so the top of the stack would still be None when the nested ExecutorRun fires. The depth branch would not be taken. That said, a RAII guard would make this invariant self-enforcing rather than documented - happy to add one if you want the stronger guarantee.
There was a problem hiding this comment.
Cursor argument tracks. The case I'm still chewing on is nested DML into the same indexed table though. With the old code an inner ExecutorRun pushed a fresh None, so a nested aminsert into table T promoted it to a separate Some({T_idx: nested_state}) and the two states didn't collide. With the depth-branch shortcut, the inner aminsert lands in the same active map and hits Entry::Occupied for the same indexrelid, which trips the existing unreachable!(). I don't have a concrete repro in mind, but a re-entrant trigger on a bm25-indexed table feels close enough that it'd be worth either ruling it out explicitly or going for the RAII guard.
There was a problem hiding this comment.
Coming back to this with a more concrete trace because I think it's actually reachable. Picture a recursive trigger on a bm25-indexed table T, or a function used in an expression index on T that runs an SPI INSERT back into T.
Outer INSERT fires, outer aminsert pushes Some({T_idx: outer_state, depth:0}). Trigger or SPI INSERT fires its own ExecutorRun, the new depth-bump branch keeps the same Some on top. Inner aminsert hits init_insert_state with a fresh ii_AmCache (different IndexInfo than outer's), so it goes through push_insert_state. get_or_insert_default returns the existing Some, active.entry(T_idx) is Occupied, unreachable!() trips.
The old code handled this: inner Run pushed a fresh None, inner aminsert promoted it to a separate Some({T_idx: inner_state}), two independent frames cleaning up in LIFO order. The new code trades the bingo panic for a different panic on a narrower-but-real path.
Two ways out I can see: relax push_insert_state to handle Entry::Occupied gracefully (reuse outer's state, accept writer-sharing), or go RAII so each nesting level gets its own frame. Happy to defer if I'm misreading the call graph, but I don't think it lands safely as-is. Let me know if a walkthrough would help.
There was a problem hiding this comment.
The RAII rework in 467d3a5 lands it. Per-frame isolation means the inner aminsert always targets a clean HashMap regardless of what the outer frame is holding, so Entry::Occupied is genuinely unreachable now. Appreciate you going for the deeper refactor instead of patching the symptom.
| let entry = EXECUTOR_RUN_STACK | ||
| .pop() | ||
| .expect("should have an ExecutorRuntimeState entry")?; | ||
| let entry = EXECUTOR_RUN_STACK.pop()??; |
There was a problem hiding this comment.
Quick aside: pop()?? quietly no-ops if the stack is unexpectedly empty, where the previous expect(...) used to surface that as a panic. If the abort-callback clearing it out really is the only legitimate path that gets here, mind dropping a one-liner so the next person doesn't have to reverse-engineer the intent? Otherwise we lose a useful tripwire if some future hook re-ordering breaks this.
There was a problem hiding this comment.
Added a comment in the latest commit explaining this. The ?? is intentional, the abort callback may have already cleared the stack before executor_finish_hook fires, which is the only legitimate empty-stack path.
There was a problem hiding this comment.
Comment reads well, thanks.
| -- | ||
| -- This test exercises the multi-row + expression-index path. It does NOT | ||
| -- trigger the nested-executor bug (that requires the external bingo extension), | ||
| -- but it guards against regressions in the normal fake-aminsertcleanup flow. |
There was a problem hiding this comment.
The comment here admits this doesn't actually exercise the bug, which makes me uneasy about it living under the #4843 banner. It would pass on the broken code too, so it isn't really a regression guard for this issue. I'd either fold it into a generic expression-index test (where it does cover real ground), or drop it and let the bingo CI work be the regression coverage for this specific path. Mixed signal as-is.
There was a problem hiding this comment.
Removed the #4843 banner in the latest commit and folded it into a neutral Test 33 heading. The bingo CI step is the actual regression coverage for this path.
There was a problem hiding this comment.
Reads better with the neutral heading, thanks.
| -- Verify updated rows | ||
| SELECT COUNT(*) FROM test_multirow_expr_guard WHERE content @@@ 'updated'; | ||
|
|
||
| DROP TABLE test_multirow_expr_guard; No newline at end of file |
There was a problem hiding this comment.
Missing trailing newline on the file. Pre-commit usually catches this; might be worth a quick rerun before the next push.
There was a problem hiding this comment.
Thanks for pointing it out, Sorted.
There was a problem hiding this comment.
Good on this one. Heads up though, the new bingo_nested_executor_fix.sql is missing its trailing newline.
- Add comment explaining pop()?? empty-stack tolerance in aminsertcleanup_stack - Remove paradedb#4843 banner from verify_bm25_index.sql (test doesn't trigger the nested executor bug, fold into generic expression-index coverage) - Add trailing newline to verify_bm25_index.sql - Add PG16 amd64 matrix entry and bingo 1.40.0 install step for CI - Add bingo_nested_executor_fix.sql regression test (validated locally on x86_64 PG16 with real bingo — INSERT 0 5, no panic)
cf57b83 to
3d37005
Compare
|
@philippemnoel The PG16 amd64 job is failing with an AWS EC2 fleet capacity error - |
mdashti
left a comment
There was a problem hiding this comment.
@Nihal-Pandey-2302 please take a look.
| execute_once: bool, | ||
| ) { | ||
| if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() { | ||
| entry.depth += 1; |
There was a problem hiding this comment.
Cursor argument tracks. The case I'm still chewing on is nested DML into the same indexed table though. With the old code an inner ExecutorRun pushed a fresh None, so a nested aminsert into table T promoted it to a separate Some({T_idx: nested_state}) and the two states didn't collide. With the depth-branch shortcut, the inner aminsert lands in the same active map and hits Entry::Occupied for the same indexrelid, which trips the existing unreachable!(). I don't have a concrete repro in mind, but a re-entrant trigger on a bm25-indexed table feels close enough that it'd be worth either ruling it out explicitly or going for the RAII guard.
| let entry = EXECUTOR_RUN_STACK | ||
| .pop() | ||
| .expect("should have an ExecutorRuntimeState entry")?; | ||
| let entry = EXECUTOR_RUN_STACK.pop()??; |
There was a problem hiding this comment.
Comment reads well, thanks.
| -- | ||
| -- This test exercises the multi-row + expression-index path. It does NOT | ||
| -- trigger the nested-executor bug (that requires the external bingo extension), | ||
| -- but it guards against regressions in the normal fake-aminsertcleanup flow. |
There was a problem hiding this comment.
Reads better with the neutral heading, thanks.
| -- Verify updated rows | ||
| SELECT COUNT(*) FROM test_multirow_expr_guard WHERE content @@@ 'updated'; | ||
|
|
||
| DROP TABLE test_multirow_expr_guard; No newline at end of file |
There was a problem hiding this comment.
Good on this one. Heads up though, the new bingo_nested_executor_fix.sql is missing its trailing newline.
| bash bingo-pg-install.sh -y -pglibdir | ||
| sudo -u postgres psql -f bingo_install.sql | ||
| sudo -u postgres psql -d postgres \ | ||
| -f pg_search/tests/pg_regress/sql/bingo_nested_executor_fix.sql |
There was a problem hiding this comment.
Couple of issues stacking up here, even once the runner allocation is sorted out:
- After the cd into the bingo dir, the relative path pg_search/tests/pg_regress/sql/bingo_nested_executor_fix.sql resolves under the bingo extraction dir, not the repo root. Want ${{ github.workspace }}/pg_search/... or step out of the bingo dir first.
- sudo -u postgres psql connects via the system Postgres socket, but the workflow uses cargo pgrx start which spins up the pgrx-managed instance on port 288xx. They're different Postgres instances; bingo lands in one and the test runs against the other.
- bingo_nested_executor_fix.sql doesn't CREATE EXTENSION pg_search, so the bm25 index will fail unless the connecting db already has pg_search loaded.
- No checksum on the wget. Worth pinning a sha256 against the zip so a tampered upstream doesn't sneak past us.
Easiest path is probably to use psql -h localhost -p 288${{ matrix.pg_version }} -U $(whoami) -d <test_db> with the pgrx connection details, and add CREATE EXTENSION pg_search at the top of the SQL file.
Overall, I'm not sure if adding this as a new CI flow is a good idea. Maybe we can confirm and then remove it?
There was a problem hiding this comment.
@mdashti All valid points. The psql connection, missing extension, path issue, and no checksum are all real problems. Given the complexity and your suggestion to defer, I'll hold off on the bingo CI step until @philippemnoel weighs in on whether he wants it at all.
There was a problem hiding this comment.
No need to add it to the CI
There was a problem hiding this comment.
@philippemnoel Done, removed the bingo CI step and the PG16 amd64 matrix entry. The validation script stays in tests/pg_regress/sql/bingo_nested_executor_fix.sql as a manual reference for anyone who needs to reproduce the bug locally with bingo installed.
There was a problem hiding this comment.
Thanks for pulling it. Two leftover bits worth tidying though, since the CI step is gone:
- bingo_nested_executor_fix.sql is still under pg_search/tests/pg_regress/sql/ but has no .out file and isn't wired into anything.
- The Postgres-log fallback step and the AMD/PG18 comment cosmetic tweaks in this same file are leftover from the CI integration attempt. Not harmful, but they don't belong in a behavior-fix PR. Easier to revert them and keep the diff focused.
|
@mdashti Good point on the re-entrant trigger case. You're right that if a BEFORE/AFTER trigger on a BM25-indexed table fires a DML that inserts into the same table, the inner aminsert would try to push an InsertState for the same indexrelid into the already-active map and hit Entry::Occupied. The depth counter doesn't help there. I'll look at whether this path is actually reachable in practice - triggers on BM25-indexed tables doing DML back into the same table is an unusual pattern, but it's not impossible. A RAII guard would handle both cases cleanly. Should I go ahead and implement it? |
mdashti
left a comment
There was a problem hiding this comment.
@Nihal-Pandey-2302 Thanks.
| bash bingo-pg-install.sh -y -pglibdir | ||
| sudo -u postgres psql -f bingo_install.sql | ||
| sudo -u postgres psql -d postgres \ | ||
| -f pg_search/tests/pg_regress/sql/bingo_nested_executor_fix.sql |
There was a problem hiding this comment.
Thanks for pulling it. Two leftover bits worth tidying though, since the CI step is gone:
- bingo_nested_executor_fix.sql is still under pg_search/tests/pg_regress/sql/ but has no .out file and isn't wired into anything.
- The Postgres-log fallback step and the AMD/PG18 comment cosmetic tweaks in this same file are leftover from the CI integration attempt. Not harmful, but they don't belong in a behavior-fix PR. Easier to revert them and keep the diff focused.
|
Done, removed |
mdashti
left a comment
There was a problem hiding this comment.
Stepping back to flag a correctness regression I caught on a closer pass through the call graph. The depth-counter approach trades the bingo panic (#4843) for a different panic on the nested-same-index DML path: recursive AFTER INSERT triggers on a bm25-indexed table, or a function used in an expression index that runs an SPI INSERT back into the same table.
Trace: outer aminsert pushes Some({T_idx: outer_state, depth:0}). Inner DML's ExecutorRun hits the new depth-bump branch, keeping the same Some on top. Inner aminsert calls init_insert_state with a fresh ii_AmCache (different IndexInfo per executor cycle), so it goes through push_insert_state, which lands at active.entry(T_idx) → Occupied → unreachable!().
Old code handled this cleanly: inner Run pushed a fresh None, inner aminsert promoted it to a separate Some({T_idx: inner_state}), two independent frames cleaning up in LIFO order.
Two ways forward I would consider before merge:
- Make push_insert_state graceful on Entry::Occupied (reuse outer state, accept writer-sharing implications).
- Take the RAII offer so each nesting level gets its own frame.
Inclined to hold the merge until that is resolved, or until we have an explicit test that exercises a recursive trigger / SPI INSERT into a bm25-indexed table. The polyfill is load-bearing for PG15-17 users until PG17 EOL, and a rare-but-real trigger pattern panicking silently in production is not a great trade for fixing #4843.
| execute_once: bool, | ||
| ) { | ||
| if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() { | ||
| entry.depth += 1; |
There was a problem hiding this comment.
Coming back to this with a more concrete trace because I think it's actually reachable. Picture a recursive trigger on a bm25-indexed table T, or a function used in an expression index on T that runs an SPI INSERT back into T.
Outer INSERT fires, outer aminsert pushes Some({T_idx: outer_state, depth:0}). Trigger or SPI INSERT fires its own ExecutorRun, the new depth-bump branch keeps the same Some on top. Inner aminsert hits init_insert_state with a fresh ii_AmCache (different IndexInfo than outer's), so it goes through push_insert_state. get_or_insert_default returns the existing Some, active.entry(T_idx) is Occupied, unreachable!() trips.
The old code handled this: inner Run pushed a fresh None, inner aminsert promoted it to a separate Some({T_idx: inner_state}), two independent frames cleaning up in LIFO order. The new code trades the bingo panic for a different panic on a narrower-but-real path.
Two ways out I can see: relax push_insert_state to handle Entry::Occupied gracefully (reuse outer's state, accept writer-sharing), or go RAII so each nesting level gets its own frame. Happy to defer if I'm misreading the call graph, but I don't think it lands safely as-is. Let me know if a walkthrough would help.
There was a problem hiding this comment.
Separate-but-related doc nit on this same file: the comment up top says EXECUTOR_RUN_STACK must never have None on top of Some(...). That's not actually true with the broader hook surface, since process_utility_hook unconditionally pushes None and can sit on top of an active aminsert frame (e.g., a DO block invoked from inside DML expression evaluation). The invariant the depth counter actually enforces is narrower: it keeps a nested ExecutorRun from shadowing the active aminsert frame specifically. Worth tightening the wording so the next person doesn't reason from a stronger guarantee than we actually provide.
There was a problem hiding this comment.
The new "what is NOT guaranteed" callout is great. Thanks.
|
@mdashti The trace is correct and I can reproduce the logic failure on paper. The depth-counter approach does break the recursive trigger / SPI-INSERT-into-same-table path. I'll implement the RAII guard, it's the cleanest fix since it restores independent frames per nesting level without the writer-sharing implications of reusing |
…up (paradedb#4843) The depth-counter approach introduced a correctness regression on the nested same-table DML path: a recursive AFTER INSERT trigger or SPI INSERT into the same BM25-indexed table would hit Entry::Occupied in push_insert_state and panic via unreachable!(). Replace Option<ExecutorRunEntry> + depth counter with InsertFrame Vec and a RAII FrameGuard. Each ExecutorRun/ProcessUtility hook invocation gets its own independent InsertFrame via FrameGuard::new(). Nested DML always pushes a fresh frame so inner aminsert calls target an empty HashMap — no collision. FrameGuard::drop runs insertcleanup for all accumulated InsertStates and pops the frame automatically, including on unwind. A std::thread::panicking() guard prevents double-panic during error paths. Also corrects the stack invariant doc comment: empty frames from ProcessUtility can legally sit above live ExecutorRun frames.
|
@mdashti Implemented the RAII approach. Each |
mdashti
left a comment
There was a problem hiding this comment.
@Nihal-Pandey-2302 Thanks. LGTM and inlined some final comments, but let's wait for the confirmation of the repro not happening on top of this PR based on the issue comments in #4843
| execute_once: bool, | ||
| ) { | ||
| if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() { | ||
| entry.depth += 1; |
There was a problem hiding this comment.
The RAII rework in 467d3a5 lands it. Per-frame isolation means the inner aminsert always targets a clean HashMap regardless of what the outer frame is holding, so Entry::Occupied is genuinely unreachable now. Appreciate you going for the deeper refactor instead of patching the symptom.
There was a problem hiding this comment.
The new "what is NOT guaranteed" callout is great. Thanks.
| // ----------------------------------------------------------------------- | ||
| // ExecutorFinish hook | ||
| // | ||
| // This hook no longer has any cleanup responsibility. By the time Postgres |
There was a problem hiding this comment.
Sanity check before merge: cleanup used to run here at ExecutorFinish, now it runs in FrameGuard::drop at the end of each ExecutorRun. Same outcome for a normal INSERT (single Run, single Finish), but a portal doing INSERT...RETURNING via repeated FETCHes would have multiple Runs against one Finish. Old code accumulated InsertState across those Runs; new code commits per Run, then ii_AmCache stays set so the second FETCH skips push_insert_state and tries to read from an empty frame.
Not sure that shape is actually reachable for indexed DML — old code arguably had a similar bug via the None-on-top path, so this might be a no-op concern. Worth a quick trace though. If it does land in panic territory, the fix is probably as simple as clearing ii_AmCache alongside the cleanup loop in FrameGuard::drop.
There was a problem hiding this comment.
@mdashti Thanks for the LGTM and the detailed review. Agreed on waiting for @Pandaaaa906's confirmation before merge.
Regarding the INSERT ... RETURNING portal edge case: you are spot on. If ii_AmCache holds a stale pointer to the state cleared by FrameGuard::drop during a subsequent Run, it will inevitably panic on cache read.
I'll prep a tiny follow-up commit to clear ii_AmCache inside FrameGuard::drop just to be safe. I'll hold off on pushing it until we hear back from the issue reporter so we don't muddy the waters while they test.
|
Hi, @Nihal-Pandey-2302
|
@Pandaaaa906 Thanks for testing man!! |

Fixes #4843
Problem
executor_run_hookunconditionally pushed a newNoneentry ontoEXECUTOR_RUN_STACKfor everyExecutorRun. When an index expression (e.g.,bingo.cansmiles) triggered a nestedExecutorRunduring multi-row insert on PG16, the outerInsertStatewas shadowed by a newNoneat the top of the stack.get_insert_statethen sawNoneand hitunreachable!.This only affected PG16 (and PG15/PG17 which use
fake_aminsertcleanup). PG18+ uses nativeaminsertcleanupand was not affected.Root Cause
The fake
aminsertcleanupmechanism assumed a 1:1 mapping betweenExecutorRunandExecutorFinish. Nested executor calls — from SPI-executing functions in expression indexes — break this assumption by pushing a second stack slot while the first is still active.Fix
depth: u32counter toExecutorRunEntryexecutor_run_hook_pg15_17/pg18: if the top stack entry is alreadySome(...), incrementdepthinstead of pushing a new slotexecutor_finish_hook: ifdepth > 0, decrement and return without popping or cleaning updepth == 0(outermost finish) do we callaminsertcleanup_stack()and finalizeaminsertcleanup_stacktolerant of empty stack (abort callback may have already cleared it)Also
#[allow(clippy::needless_update)]tofinalize_clause_into_pathbecausepg_sys::CustomPathhas version-dependent fields (e.g.,custom_restrictinfo) that are safely filled by..Default::default().Testing
cargo pgrx test pg16: 185 passed, 0 failedcargo clippy -p pg_search --no-default-features --features pg16 -- -D warnings: cleanbingoextension which triggers nestedExecutorRunvia internal C++ APIs during expression evaluation. I attempted to reproduce with a minimal C extension usingSPI_execandSPI_cursor_open, but these do not trigger the sameExecutorRun_hooknesting path asbingo. The issue reporter (@Pandaaaa906) has the only known repro environment and has been asked to validate this fix.