Skip to content

fix: handle nested ExecutorRun in fake_aminsertcleanup (#4843)#4924

Open
Nihal-Pandey-2302 wants to merge 11 commits into
paradedb:mainfrom
Nihal-Pandey-2302:fix/insert-state-nested-executor-4843
Open

fix: handle nested ExecutorRun in fake_aminsertcleanup (#4843)#4924
Nihal-Pandey-2302 wants to merge 11 commits into
paradedb:mainfrom
Nihal-Pandey-2302:fix/insert-state-nested-executor-4843

Conversation

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor

Fixes #4843

Problem

executor_run_hook unconditionally pushed a new None entry onto EXECUTOR_RUN_STACK for every ExecutorRun. When an index expression (e.g., bingo.cansmiles) triggered a nested ExecutorRun during multi-row insert on PG16, the outer InsertState was shadowed by a new None at the top of the stack. get_insert_state then saw None and hit unreachable!.

This only affected PG16 (and PG15/PG17 which use fake_aminsertcleanup). PG18+ uses native aminsertcleanup and was not affected.

Root Cause

The fake aminsertcleanup mechanism assumed a 1:1 mapping between ExecutorRun and ExecutorFinish. Nested executor calls — from SPI-executing functions in expression indexes — break this assumption by pushing a second stack slot while the first is still active.

Fix

  • Add a depth: u32 counter to ExecutorRunEntry
  • In executor_run_hook_pg15_17 / pg18: if the top stack entry is already Some(...), increment depth instead of pushing a new slot
  • In executor_finish_hook: if depth > 0, decrement and return without popping or cleaning up
  • Only at depth == 0 (outermost finish) do we call aminsertcleanup_stack() and finalize
  • Make aminsertcleanup_stack tolerant of empty stack (abort callback may have already cleared it)

Also

  • Adds #[allow(clippy::needless_update)] to finalize_clause_into_path because pg_sys::CustomPath has version-dependent fields (e.g., custom_restrictinfo) that are safely filled by ..Default::default().

Testing

  • cargo pgrx test pg16: 185 passed, 0 failed
  • cargo clippy -p pg_search --no-default-features --features pg16 -- -D warnings: clean
  • Manual multi-row insert with expression index: passes
  • Note: The exact repro requires the bingo extension which triggers nested ExecutorRun via internal C++ APIs during expression evaluation. I attempted to reproduce with a minimal C extension using SPI_exec and SPI_cursor_open, but these do not trigger the same ExecutorRun_hook nesting path as bingo. The issue reporter (@Pandaaaa906) has the only known repro environment and has been asked to validate this fix.

Copy link
Copy Markdown
Member

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a test for this please?

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.06%. Comparing base (b09f4be) to head (484a082).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4924      +/-   ##
==========================================
+ Coverage   76.83%   77.06%   +0.23%     
==========================================
  Files         200      199       -1     
  Lines       54468    54293     -175     
==========================================
- Hits        41852    41843       -9     
+ Misses      12616    12450     -166     
Files with missing lines Coverage Δ
pg_search/src/postgres/customscan/joinscan/mod.rs 91.48% <ø> (+0.06%) ⬆️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Nihal-Pandey-2302 Nihal-Pandey-2302 force-pushed the fix/insert-state-nested-executor-4843 branch 2 times, most recently from b03aeaa to 4fbce68 Compare April 29, 2026 19:14
@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@philippemnoel I added the doc comment on the stack invariant and a regression guard in verify_bm25_index.sql to ensure the normal multi-row expression index path stays intact.

I tried to write a dedicated test to trigger the nested executor panic but ran into tooling walls. Standard PL/pgSQL or SPI_exec don't actually trigger the nested ExecutorRun_hook path—the nesting seems specific to how bingo.cansmiles evaluates expressions via its internal C++ APIs.

I also tried writing a Rust-level FFI test to mock PREV_EXECUTOR_RUN_HOOK and explicitly verify the depth logic. That failed on two fronts: the cargo pgrx test schema generator silently strips out test functions that mock unsafe static C-pointers, and falling back to a standard cargo test fails linking because the Postgres backend symbols (like ExceptionalCondition) aren't available in standalone targets.

Since we can't reliably mock the FFI pointers in the test harness, and standard SQL won't nest the executor, the regression guard is the safest bet. @Pandaaaa906 has the exact bingo setup required to trigger this, so they are in the best position to validate the fix on their end.

@philippemnoel
Copy link
Copy Markdown
Member

That does not sound ideal. Without a real test, how can we guarantee that this feature won't regress?

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@philippemnoel I completely agree it's not ideal. It is never comfortable flying blind on a regression.

To put the long-term risk into perspective: this bug specifically lives in fake_aminsertcleanup.rs, which is a polyfill exclusively for PG15–PG17. PG18+ uses the native aminsertcleanup API and is completely immune to this issue. Because of this, the code we are patching has a limited lifespan and will eventually be deprecated.

If we must have an automated guarantee for this PG15-17 shim, the only reliable path forward is to build a custom #[pg_extern] function (compiled only under #[cfg(test)]) that explicitly triggers a nested SPI_execute call. We would then have to wire that dummy function into an expression index in our regression suite to trick the Postgres C-backend into natively nesting the ExecutorRun hooks.

It is absolutely doable, but it requires wiring up some heavy, test-specific FFI scaffolding just to validate a legacy polyfill.

Given this only affects the pre-PG18 shim, do you feel that test scaffolding is worth the maintenance overhead? Or are we okay relying on the happy-path regression guard + the user's manual validation for this specific edge case? I am fully happy to build out the SPI test if you prefer the strict guarantee!

@Nihal-Pandey-2302 Nihal-Pandey-2302 force-pushed the fix/insert-state-nested-executor-4843 branch from 3d03468 to 6463970 Compare April 30, 2026 06:58
@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

Done , went ahead and built it out. The #[pg_extern] + Spi::run() approach works without the FFI scaffolding overhead I was worried about.

@philippemnoel Added the regression test in the latest commit. A #[pg_extern] function spi_identity calls Spi::run("SELECT 1") internally. When used as a BM25 expression index, a multi-row INSERT evaluates it per row, firing ExecutorRun_hook from inside expression evaluation - the exact nesting path the bug described. No external extension needed.

Without the depth-counter fix this panics. With the fix: 185 passed, 0 failed on pg16.

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@philippemnoel Retracting my previous comment , the spi_identity test was a false green. I verified locally that it passes even against the original broken code. Standard SPI does not re-enter ExecutorRun_hook; it goes through the SPI manager which bypasses the hook chain entirely. The nesting is specific to extensions like bingo that call into the executor directly via internal C++ APIs. I've removed the test from the branch.

To summarize everything I tried:

  • Spi::run() / SPI_execute - does not re-enter the hook
  • PL/pgSQL functions - same, routed through SPI
  • Rust FFI mocking of PREV_EXECUTOR_RUN_HOOK - stripped by the pgrx schema generator; fails to link in standalone cargo test due to missing Postgres backend symbols
  • Custom #[pg_extern] as an expression index - the approach I claimed worked, but confirmed to be a false green

The only real path to an automated test is a custom C extension that calls ExecutorRun() directly outside SPI. That's meaningful infrastructure for a polyfill that only exists for PG15–17 and gets deleted when PG17 is EOL.

What the branch does have instead: the doc comment on EXECUTOR_RUN_STACK now specifies the exact stack invariant, the before/after execution trace, and explicitly documents why no automated test exists. Any future change to the hook logic has to reconcile with it. The regression guarantee for this specific nesting path comes from @Pandaaaa906 validating against their bingo setup.

If you want the C extension route I'll build it, just say the word. Otherwise I'd argue the documented invariant + reporter validation is the right tradeoff for a legacy shim with a known deprecation timeline.

@Nihal-Pandey-2302 Nihal-Pandey-2302 force-pushed the fix/insert-state-nested-executor-4843 branch from 83c178c to c757570 Compare April 30, 2026 07:18
@philippemnoel
Copy link
Copy Markdown
Member

That sounds reasonable but then we would be a confirmation that a manual test was run and that this is correct

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

That sounds reasonable but then we would be a confirmation that a manual test was run and that this is correct

@philippemnoel Agreed. I've tagged @Pandaaaa906 directly on the issue asking them to validate the fix against their bingo setup. Once they confirm it resolves the panic on their end, that serves as the manual test confirmation.

@philippemnoel
Copy link
Copy Markdown
Member

That sounds reasonable but then we would be a confirmation that a manual test was run and that this is correct

@philippemnoel Agreed. I've tagged @Pandaaaa906 directly on the issue asking them to validate the fix against their bingo setup. Once they confirm it resolves the panic on their end, that serves as the manual test confirmation.

It would be preferrable if you could come up with a script that tests with bingo.cansmile, even if we don't commit it as a test.

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@philippemnoel Makes total sense. As requested, I won't commit this to the repo to avoid polluting the CI environments that don't have bingo installed, but here is the exact validation script that reproduces the environment and verifies the fix:

-- Validation script for issue #4843
-- Requirements: ParadeDB (with pg_search), bingo extension, PostgreSQL 16
-- Run via: psql -d <db_name> -f bingo_validation.sql

CREATE EXTENSION IF NOT EXISTS bingo;

CREATE TABLE bingo_nested_exec_test (
    id      SERIAL PRIMARY KEY,
    smiles  TEXT NOT NULL
);

-- The bingo.cansmiles expression evaluation triggers the internal SPI call
-- which forces the nested ExecutorRun_hook
CREATE INDEX bingo_nested_exec_test_bm25_idx
    ON bingo_nested_exec_test
    USING bm25 (id, (bingo.cansmiles(smiles)::pdb.simple))
    WITH (key_field = 'id');

-- Expected Before Fix: PANIC "entered unreachable code: get_insert_state"
-- Expected After Fix: INSERT 0 5
INSERT INTO bingo_nested_exec_test (smiles)
SELECT repeat('C', n)
FROM generate_series(1, 5) AS n;

-- Expected: 5
SELECT COUNT(*) AS row_count FROM bingo_nested_exec_test;

DROP TABLE bingo_nested_exec_test;

I have tagged the issue reporter so they can run this exact snippet against their local bingo dataset to confirm the panic is resolved on their end!

@philippemnoel
Copy link
Copy Markdown
Member

@philippemnoel Makes total sense. As requested, I won't commit this to the repo to avoid polluting the CI environments that don't have bingo installed, but here is the exact validation script that reproduces the environment and verifies the fix:

-- Validation script for issue #4843
-- Requirements: ParadeDB (with pg_search), bingo extension, PostgreSQL 16
-- Run via: psql -d <db_name> -f bingo_validation.sql

CREATE EXTENSION IF NOT EXISTS bingo;

CREATE TABLE bingo_nested_exec_test (
    id      SERIAL PRIMARY KEY,
    smiles  TEXT NOT NULL
);

-- The bingo.cansmiles expression evaluation triggers the internal SPI call
-- which forces the nested ExecutorRun_hook
CREATE INDEX bingo_nested_exec_test_bm25_idx
    ON bingo_nested_exec_test
    USING bm25 (id, (bingo.cansmiles(smiles)::pdb.simple))
    WITH (key_field = 'id');

-- Expected Before Fix: PANIC "entered unreachable code: get_insert_state"
-- Expected After Fix: INSERT 0 5
INSERT INTO bingo_nested_exec_test (smiles)
SELECT repeat('C', n)
FROM generate_series(1, 5) AS n;

-- Expected: 5
SELECT COUNT(*) AS row_count FROM bingo_nested_exec_test;

DROP TABLE bingo_nested_exec_test;

I have tagged the issue reporter so they can run this exact snippet against their local bingo dataset to confirm the panic is resolved on their end!

That's a pretty small test. I think we should install Bingo in CI on PG16 and then have this as a test as I suggested initially.

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

That's a pretty small test. I think we should install Bingo in CI on PG16 and then have this as a test as I suggested initially.

@philippemnoel I can definitely wire this up, but I want to flag the infrastructure cost before I modify the workflow files.

The bingo extension isn't available via standard apt or pgdg repositories. To install it in CI, we would have to add a step to the GitHub Actions workflow that curls the pre-compiled binary zip directly from the EPAM website, extracts it, and executes their custom shell script to copy the .so files into the Postgres lib directory before pg_regress runs.

If you are comfortable adding a manually downloaded, third-party binary step to the CI pipeline just to test this PG15-17 polyfill, I will go ahead and build it!

Let me know if you are good with that repo-hygiene tradeoff, and I'll push the workflow changes.

@philippemnoel
Copy link
Copy Markdown
Member

That's a pretty small test. I think we should install Bingo in CI on PG16 and then have this as a test as I suggested initially.

@philippemnoel I can definitely wire this up, but I want to flag the infrastructure cost before I modify the workflow files.

The bingo extension isn't available via standard apt or pgdg repositories. To install it in CI, we would have to add a step to the GitHub Actions workflow that curls the pre-compiled binary zip directly from the EPAM website, extracts it, and executes their custom shell script to copy the .so files into the Postgres lib directory before pg_regress runs.

If you are comfortable adding a manually downloaded, third-party binary step to the CI pipeline just to test this PG15-17 polyfill, I will go ahead and build it!

Let me know if you are good with that repo-hygiene tradeoff, and I'll push the workflow changes.

That's probably just a few seconds, no? Let's only run it on PG16 or whatever is the last version that requires this fix.

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

That's probably just a few seconds, no? Let's only run it on PG16 or whatever is the last version that requires this fix.

@philippemnoel Sounds like a plan. I will scope the bingo installation and the regression script strictly to the PG16 matrix run so we don't bloat the other test environments. I'll push the workflow updates shortly!

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

Nihal-Pandey-2302 commented May 1, 2026

@philippemnoel Hit an architecture blocker on the CI integration. Bingo only ships prebuilt Linux binaries for x86_64, there is no aarch64 build. I verified this against the official download page and confirmed the URL returns empty.

The problem: all PG16 system matrix runs use ubicloud-standard-8-arm-ubuntu-2404 (arm64), so the bingo .so can't be loaded there. The only amd64 runner currently in the matrix is the PG18 system run.

That said, I did validate the full fix end-to-end locally on x86_64 PG16 (system Postgres, not pgrx). Here's exactly what I ran:

  1. Downloaded bingo 1.40.0 for PG16 linux x86_64 from the EPAM download server
  2. Copied libbingo-postgres.so to /usr/lib/postgresql/16/lib/
  3. Ran bingo-pg-install.sh -y -pglibdir to generate bingo_install.sql
  4. Installed bingo schema via sudo -u postgres psql -f bingo_install.sql
  5. Installed pg_search via cargo pgrx install against the system PG16
  6. Ran tests/pg_regress/sql/bingo_nested_executor_fix.sql -> the script creates a BM25 expression index using bingo.cansmiles, inserts 5 rows in a single multi-row INSERT, and confirms all 5 land cleanly

Output:

CREATE TABLE
CREATE INDEX
INSERT 0 5
 row_count
-----------
         5
(1 row)
DROP TABLE

No panic. The fix is confirmed working against real bingo on real PG16 x86_64.

To run this in CI we'd need to add a dedicated PG16 amd64 matrix entry (same pattern as the existing PG18 amd64 include). That adds one extra runner to the PR matrix. Is that acceptable, or would you prefer to keep the regression script as a manual validation artifact and merge as-is?

@philippemnoel philippemnoel added the cherry-pick/0.23.x Request that this PR to `main` should get an automatic cherry-pick PR to `0.23.x` after it lands. label May 1, 2026
@philippemnoel philippemnoel force-pushed the fix/insert-state-nested-executor-4843 branch from f92e3f1 to 04efbc7 Compare May 1, 2026 16:47
philippemnoel added a commit that referenced this pull request May 1, 2026
## Summary

- SchemaBot was failing with phantom "missing migration" errors on PRs
from forks (e.g. #4924). The workflow checks out the PR head's repo and
then runs `git checkout ${{ pull_request.base.ref }}` for the "old"
schema — which on a fork PR resolves to the fork's own (often stale)
`main`. The diff then surfaces every entity added to upstream `main`
since the fork was last synced, and additionally downgrades cargo-pgrx
if the fork still pins an older version. Fetch the base ref from the
PR's base repo and check that out instead, so the diff is always against
the real upstream main.
- Adjacent fix: `FIRST_PGRX_VERSION` was being set from
`${PGRX_VERSION}`, an undefined env var, so it was always empty and the
version-mismatch reinstall path in the base-checkout step always fired.
Source the value from `steps.pgrx.outputs.version` (already extracted at
the top of the job) instead. Without this the "did pgrx change between
base and head?" guard never short-circuits.

## Test plan

- [x] Re-run SchemaBot on #4924 (fork PR) and confirm it now passes
without the contributor needing to rebase.
- [x] Confirm SchemaBot still passes on an in-org PR (sanity check that
the `upstream` remote dance doesn't break the same-repo case).
- [x] Confirm the cargo-pgrx reinstall step is now skipped when base and
head pin the same pgrx version (look for absence of "Replaced package
cargo-pgrx" in the log).
- Adds depth counter to ExecutorRunEntry to prevent shadowing during nested runs.
- Adds stack nesting invariant doc comment.
- Adds SQL regression guard for multi-row expression-index inserts.
- Suppresses clippy needless_update warning in joinscan.
Add a #[pg_test] that reproduces the nesting pattern from the bug
without requiring the bingo extension.

A #[pg_extern] function (spi_identity) runs Spi::run() internally.
When used as a BM25 expression index, evaluating it during a multi-row
INSERT fires a nested ExecutorRun_hook while the outer stack slot is
still live — the same code path that bingo.cansmiles triggers via its
internal C++ SPI calls.

Without the depth-counter fix this panics with:
  entered unreachable code: get_insert_state

185 passed, 0 failed on pg16.
@philippemnoel philippemnoel force-pushed the fix/insert-state-nested-executor-4843 branch from 04efbc7 to a55fde2 Compare May 1, 2026 19:39
Comment thread pg_search/src/postgres/fake_aminsertcleanup.rs Outdated
Copy link
Copy Markdown
Contributor

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts from the review side, dropping them here since they're conversational rather than line-specific:

On the bingo CI question — the arm64 issue is real but I don't think it's actually a blocker. The matrix already has a PG18 amd64 entry; mirroring that for PG16 (the only version that needs this shim) gets you bingo coverage without touching the arm64 runs. A few seconds of curl + a one-time install in the workflow is the cost, and it scopes cleanly to the PG16 row.

Separately, @Pandaaaa906's manual confirmation against their bingo setup is the load-bearing evidence on this PR right now. Can we add a repro in the tests?

Comment thread pg_search/src/postgres/fake_aminsertcleanup.rs Outdated
execute_once: bool,
) {
if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() {
entry.depth += 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are we that nested ExecutorRun calls always pair 1:1 with ExecutorFinish? Cursors and some SPI shapes can fire Run multiple times against the same QueryDesc with a single trailing Finish. If a path like that goes through this branch, depth gets incremented but never fully drained, and the outer Finish skips cleanup. Worth tracing what happens with a SELECT inside a SECURITY DEFINER, or a cursor-bound query inside the index expression, before we land this. Would love to see a RAII mechanism here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1:1 assumption holds for this specific path because the depth counter only increments when the top of the stack is already Some(...), meaning aminsert has already pushed an InsertState. That only happens inside a DML executor cycle. A cursor firing ExecutorRun multiple times against the same QueryDesc would be a scan, not a DML - it would never call aminsert, never push an InsertState, and so the top of the stack would still be None when the nested ExecutorRun fires. The depth branch would not be taken. That said, a RAII guard would make this invariant self-enforcing rather than documented - happy to add one if you want the stronger guarantee.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor argument tracks. The case I'm still chewing on is nested DML into the same indexed table though. With the old code an inner ExecutorRun pushed a fresh None, so a nested aminsert into table T promoted it to a separate Some({T_idx: nested_state}) and the two states didn't collide. With the depth-branch shortcut, the inner aminsert lands in the same active map and hits Entry::Occupied for the same indexrelid, which trips the existing unreachable!(). I don't have a concrete repro in mind, but a re-entrant trigger on a bm25-indexed table feels close enough that it'd be worth either ruling it out explicitly or going for the RAII guard.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this with a more concrete trace because I think it's actually reachable. Picture a recursive trigger on a bm25-indexed table T, or a function used in an expression index on T that runs an SPI INSERT back into T.

Outer INSERT fires, outer aminsert pushes Some({T_idx: outer_state, depth:0}). Trigger or SPI INSERT fires its own ExecutorRun, the new depth-bump branch keeps the same Some on top. Inner aminsert hits init_insert_state with a fresh ii_AmCache (different IndexInfo than outer's), so it goes through push_insert_state. get_or_insert_default returns the existing Some, active.entry(T_idx) is Occupied, unreachable!() trips.

The old code handled this: inner Run pushed a fresh None, inner aminsert promoted it to a separate Some({T_idx: inner_state}), two independent frames cleaning up in LIFO order. The new code trades the bingo panic for a different panic on a narrower-but-real path.

Two ways out I can see: relax push_insert_state to handle Entry::Occupied gracefully (reuse outer's state, accept writer-sharing), or go RAII so each nesting level gets its own frame. Happy to defer if I'm misreading the call graph, but I don't think it lands safely as-is. Let me know if a walkthrough would help.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RAII rework in 467d3a5 lands it. Per-frame isolation means the inner aminsert always targets a clean HashMap regardless of what the outer frame is holding, so Entry::Occupied is genuinely unreachable now. Appreciate you going for the deeper refactor instead of patching the symptom.

let entry = EXECUTOR_RUN_STACK
.pop()
.expect("should have an ExecutorRuntimeState entry")?;
let entry = EXECUTOR_RUN_STACK.pop()??;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick aside: pop()?? quietly no-ops if the stack is unexpectedly empty, where the previous expect(...) used to surface that as a panic. If the abort-callback clearing it out really is the only legitimate path that gets here, mind dropping a one-liner so the next person doesn't have to reverse-engineer the intent? Otherwise we lose a useful tripwire if some future hook re-ordering breaks this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in the latest commit explaining this. The ?? is intentional, the abort callback may have already cleared the stack before executor_finish_hook fires, which is the only legitimate empty-stack path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment reads well, thanks.

Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs
--
-- This test exercises the multi-row + expression-index path. It does NOT
-- trigger the nested-executor bug (that requires the external bingo extension),
-- but it guards against regressions in the normal fake-aminsertcleanup flow.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here admits this doesn't actually exercise the bug, which makes me uneasy about it living under the #4843 banner. It would pass on the broken code too, so it isn't really a regression guard for this issue. I'd either fold it into a generic expression-index test (where it does cover real ground), or drop it and let the bingo CI work be the regression coverage for this specific path. Mixed signal as-is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the #4843 banner in the latest commit and folded it into a neutral Test 33 heading. The bingo CI step is the actual regression coverage for this path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reads better with the neutral heading, thanks.

-- Verify updated rows
SELECT COUNT(*) FROM test_multirow_expr_guard WHERE content @@@ 'updated';

DROP TABLE test_multirow_expr_guard; No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing trailing newline on the file. Pre-commit usually catches this; might be worth a quick rerun before the next push.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, Sorted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good on this one. Heads up though, the new bingo_nested_executor_fix.sql is missing its trailing newline.

- Add comment explaining pop()?? empty-stack tolerance in aminsertcleanup_stack
- Remove paradedb#4843 banner from verify_bm25_index.sql (test doesn't trigger
  the nested executor bug, fold into generic expression-index coverage)
- Add trailing newline to verify_bm25_index.sql
- Add PG16 amd64 matrix entry and bingo 1.40.0 install step for CI
- Add bingo_nested_executor_fix.sql regression test (validated locally
  on x86_64 PG16 with real bingo — INSERT 0 5, no panic)
@Nihal-Pandey-2302 Nihal-Pandey-2302 force-pushed the fix/insert-state-nested-executor-4843 branch from cf57b83 to 3d37005 Compare May 2, 2026 06:46
@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@philippemnoel The PG16 amd64 job is failing with an AWS EC2 fleet capacity error - We couldn't find any instance pools that match your instance requirements. This is a transient spot instance availability issue, not a code problem. Could you re-run it from your end or confirm you're seeing the same? All other checks are passing.

Copy link
Copy Markdown
Contributor

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nihal-Pandey-2302 please take a look.

Comment thread pg_search/src/postgres/fake_aminsertcleanup.rs Outdated
execute_once: bool,
) {
if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() {
entry.depth += 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor argument tracks. The case I'm still chewing on is nested DML into the same indexed table though. With the old code an inner ExecutorRun pushed a fresh None, so a nested aminsert into table T promoted it to a separate Some({T_idx: nested_state}) and the two states didn't collide. With the depth-branch shortcut, the inner aminsert lands in the same active map and hits Entry::Occupied for the same indexrelid, which trips the existing unreachable!(). I don't have a concrete repro in mind, but a re-entrant trigger on a bm25-indexed table feels close enough that it'd be worth either ruling it out explicitly or going for the RAII guard.

let entry = EXECUTOR_RUN_STACK
.pop()
.expect("should have an ExecutorRuntimeState entry")?;
let entry = EXECUTOR_RUN_STACK.pop()??;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment reads well, thanks.

Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs
--
-- This test exercises the multi-row + expression-index path. It does NOT
-- trigger the nested-executor bug (that requires the external bingo extension),
-- but it guards against regressions in the normal fake-aminsertcleanup flow.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reads better with the neutral heading, thanks.

-- Verify updated rows
SELECT COUNT(*) FROM test_multirow_expr_guard WHERE content @@@ 'updated';

DROP TABLE test_multirow_expr_guard; No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good on this one. Heads up though, the new bingo_nested_executor_fix.sql is missing its trailing newline.

Comment thread .github/workflows/test-pg_search.yml Outdated
Comment thread .github/workflows/test-pg_search.yml Outdated
bash bingo-pg-install.sh -y -pglibdir
sudo -u postgres psql -f bingo_install.sql
sudo -u postgres psql -d postgres \
-f pg_search/tests/pg_regress/sql/bingo_nested_executor_fix.sql
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of issues stacking up here, even once the runner allocation is sorted out:

  • After the cd into the bingo dir, the relative path pg_search/tests/pg_regress/sql/bingo_nested_executor_fix.sql resolves under the bingo extraction dir, not the repo root. Want ${{ github.workspace }}/pg_search/... or step out of the bingo dir first.
  • sudo -u postgres psql connects via the system Postgres socket, but the workflow uses cargo pgrx start which spins up the pgrx-managed instance on port 288xx. They're different Postgres instances; bingo lands in one and the test runs against the other.
  • bingo_nested_executor_fix.sql doesn't CREATE EXTENSION pg_search, so the bm25 index will fail unless the connecting db already has pg_search loaded.
  • No checksum on the wget. Worth pinning a sha256 against the zip so a tampered upstream doesn't sneak past us.

Easiest path is probably to use psql -h localhost -p 288${{ matrix.pg_version }} -U $(whoami) -d <test_db> with the pgrx connection details, and add CREATE EXTENSION pg_search at the top of the SQL file.

Overall, I'm not sure if adding this as a new CI flow is a good idea. Maybe we can confirm and then remove it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdashti All valid points. The psql connection, missing extension, path issue, and no checksum are all real problems. Given the complexity and your suggestion to defer, I'll hold off on the bingo CI step until @philippemnoel weighs in on whether he wants it at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add it to the CI

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philippemnoel Done, removed the bingo CI step and the PG16 amd64 matrix entry. The validation script stays in tests/pg_regress/sql/bingo_nested_executor_fix.sql as a manual reference for anyone who needs to reproduce the bug locally with bingo installed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling it. Two leftover bits worth tidying though, since the CI step is gone:

  • bingo_nested_executor_fix.sql is still under pg_search/tests/pg_regress/sql/ but has no .out file and isn't wired into anything.
  • The Postgres-log fallback step and the AMD/PG18 comment cosmetic tweaks in this same file are leftover from the CI integration attempt. Not harmful, but they don't belong in a behavior-fix PR. Easier to revert them and keep the diff focused.

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@mdashti Good point on the re-entrant trigger case. You're right that if a BEFORE/AFTER trigger on a BM25-indexed table fires a DML that inserts into the same table, the inner aminsert would try to push an InsertState for the same indexrelid into the already-active map and hit Entry::Occupied. The depth counter doesn't help there. I'll look at whether this path is actually reachable in practice - triggers on BM25-indexed tables doing DML back into the same table is an unusual pattern, but it's not impossible. A RAII guard would handle both cases cleanly. Should I go ahead and implement it?

Copy link
Copy Markdown
Contributor

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread .github/workflows/test-pg_search.yml Outdated
bash bingo-pg-install.sh -y -pglibdir
sudo -u postgres psql -f bingo_install.sql
sudo -u postgres psql -d postgres \
-f pg_search/tests/pg_regress/sql/bingo_nested_executor_fix.sql
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling it. Two leftover bits worth tidying though, since the CI step is gone:

  • bingo_nested_executor_fix.sql is still under pg_search/tests/pg_regress/sql/ but has no .out file and isn't wired into anything.
  • The Postgres-log fallback step and the AMD/PG18 comment cosmetic tweaks in this same file are leftover from the CI integration attempt. Not harmful, but they don't belong in a behavior-fix PR. Easier to revert them and keep the diff focused.

Comment thread .github/workflows/test-pg_search.yml Outdated
@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

Done, removed bingo_nested_executor_fix.sql, reverted the log fallback step, and restored the original AMD comment. The diff is now focused on the executor fix only.

Copy link
Copy Markdown
Contributor

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back to flag a correctness regression I caught on a closer pass through the call graph. The depth-counter approach trades the bingo panic (#4843) for a different panic on the nested-same-index DML path: recursive AFTER INSERT triggers on a bm25-indexed table, or a function used in an expression index that runs an SPI INSERT back into the same table.

Trace: outer aminsert pushes Some({T_idx: outer_state, depth:0}). Inner DML's ExecutorRun hits the new depth-bump branch, keeping the same Some on top. Inner aminsert calls init_insert_state with a fresh ii_AmCache (different IndexInfo per executor cycle), so it goes through push_insert_state, which lands at active.entry(T_idx) → Occupied → unreachable!().

Old code handled this cleanly: inner Run pushed a fresh None, inner aminsert promoted it to a separate Some({T_idx: inner_state}), two independent frames cleaning up in LIFO order.

Two ways forward I would consider before merge:

  1. Make push_insert_state graceful on Entry::Occupied (reuse outer state, accept writer-sharing implications).
  2. Take the RAII offer so each nesting level gets its own frame.

Inclined to hold the merge until that is resolved, or until we have an explicit test that exercises a recursive trigger / SPI INSERT into a bm25-indexed table. The polyfill is load-bearing for PG15-17 users until PG17 EOL, and a rare-but-real trigger pattern panicking silently in production is not a great trade for fixing #4843.

execute_once: bool,
) {
if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() {
entry.depth += 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this with a more concrete trace because I think it's actually reachable. Picture a recursive trigger on a bm25-indexed table T, or a function used in an expression index on T that runs an SPI INSERT back into T.

Outer INSERT fires, outer aminsert pushes Some({T_idx: outer_state, depth:0}). Trigger or SPI INSERT fires its own ExecutorRun, the new depth-bump branch keeps the same Some on top. Inner aminsert hits init_insert_state with a fresh ii_AmCache (different IndexInfo than outer's), so it goes through push_insert_state. get_or_insert_default returns the existing Some, active.entry(T_idx) is Occupied, unreachable!() trips.

The old code handled this: inner Run pushed a fresh None, inner aminsert promoted it to a separate Some({T_idx: inner_state}), two independent frames cleaning up in LIFO order. The new code trades the bingo panic for a different panic on a narrower-but-real path.

Two ways out I can see: relax push_insert_state to handle Entry::Occupied gracefully (reuse outer's state, accept writer-sharing), or go RAII so each nesting level gets its own frame. Happy to defer if I'm misreading the call graph, but I don't think it lands safely as-is. Let me know if a walkthrough would help.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate-but-related doc nit on this same file: the comment up top says EXECUTOR_RUN_STACK must never have None on top of Some(...). That's not actually true with the broader hook surface, since process_utility_hook unconditionally pushes None and can sit on top of an active aminsert frame (e.g., a DO block invoked from inside DML expression evaluation). The invariant the depth counter actually enforces is narrower: it keeps a nested ExecutorRun from shadowing the active aminsert frame specifically. Worth tightening the wording so the next person doesn't reason from a stronger guarantee than we actually provide.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new "what is NOT guaranteed" callout is great. Thanks.

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@mdashti The trace is correct and I can reproduce the logic failure on paper. The depth-counter approach does break the recursive trigger / SPI-INSERT-into-same-table path. I'll implement the RAII guard, it's the cleanest fix since it restores independent frames per nesting level without the writer-sharing implications of reusing Entry::Occupied. I'll also tighten the doc comment to reflect the narrower invariant. Give me some time to get this right.

…up (paradedb#4843)

The depth-counter approach introduced a correctness regression on the
nested same-table DML path: a recursive AFTER INSERT trigger or SPI INSERT
into the same BM25-indexed table would hit Entry::Occupied in
push_insert_state and panic via unreachable!().

Replace Option<ExecutorRunEntry> + depth counter with InsertFrame Vec and
a RAII FrameGuard. Each ExecutorRun/ProcessUtility hook invocation gets its
own independent InsertFrame via FrameGuard::new(). Nested DML always pushes
a fresh frame so inner aminsert calls target an empty HashMap — no collision.

FrameGuard::drop runs insertcleanup for all accumulated InsertStates and
pops the frame automatically, including on unwind. A std::thread::panicking()
guard prevents double-panic during error paths.

Also corrects the stack invariant doc comment: empty frames from
ProcessUtility can legally sit above live ExecutorRun frames.
@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

@mdashti Implemented the RAII approach. Each ExecutorRun/ProcessUtility hook invocation now gets its own independent InsertFrame via FrameGuard::new(). Nested DML into the same table always pushes a fresh frame so inner aminsert calls target an empty HashMap - no Entry::Occupied collision. FrameGuard::drop handles cleanup automatically including on unwind, with a std::thread::panicking() guard to prevent double-panic. Also corrected the doc comment overstating the stack invariant. 255 tests pass locally on pg16.

@Nihal-Pandey-2302 Nihal-Pandey-2302 requested a review from mdashti May 3, 2026 14:21
Copy link
Copy Markdown
Contributor

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nihal-Pandey-2302 Thanks. LGTM and inlined some final comments, but let's wait for the confirmation of the repro not happening on top of this PR based on the issue comments in #4843

execute_once: bool,
) {
if let Some(Some(entry)) = EXECUTOR_RUN_STACK.last_mut() {
entry.depth += 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RAII rework in 467d3a5 lands it. Per-frame isolation means the inner aminsert always targets a clean HashMap regardless of what the outer frame is holding, so Entry::Occupied is genuinely unreachable now. Appreciate you going for the deeper refactor instead of patching the symptom.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new "what is NOT guaranteed" callout is great. Thanks.

// -----------------------------------------------------------------------
// ExecutorFinish hook
//
// This hook no longer has any cleanup responsibility. By the time Postgres
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check before merge: cleanup used to run here at ExecutorFinish, now it runs in FrameGuard::drop at the end of each ExecutorRun. Same outcome for a normal INSERT (single Run, single Finish), but a portal doing INSERT...RETURNING via repeated FETCHes would have multiple Runs against one Finish. Old code accumulated InsertState across those Runs; new code commits per Run, then ii_AmCache stays set so the second FETCH skips push_insert_state and tries to read from an empty frame.

Not sure that shape is actually reachable for indexed DML — old code arguably had a similar bug via the None-on-top path, so this might be a no-op concern. Worth a quick trace though. If it does land in panic territory, the fix is probably as simple as clearing ii_AmCache alongside the cleanup loop in FrameGuard::drop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdashti Thanks for the LGTM and the detailed review. Agreed on waiting for @Pandaaaa906's confirmation before merge.

Regarding the INSERT ... RETURNING portal edge case: you are spot on. If ii_AmCache holds a stale pointer to the state cleared by FrameGuard::drop during a subsequent Run, it will inevitably panic on cache read.

I'll prep a tiny follow-up commit to clear ii_AmCache inside FrameGuard::drop just to be safe. I'll hold off on pushing it until we hear back from the issue reporter so we don't muddy the waters while they test.

@Pandaaaa906
Copy link
Copy Markdown

Pandaaaa906 commented May 16, 2026

Hi, @Nihal-Pandey-2302
I confirm your fix is working

image

@Nihal-Pandey-2302
Copy link
Copy Markdown
Contributor Author

Nihal-Pandey-2302 commented May 16, 2026

Hi, @Nihal-Pandey-2302 I confirm your fix is working

@Pandaaaa906 Thanks for testing man!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick/0.23.x Request that this PR to `main` should get an automatic cherry-pick PR to `0.23.x` after it lands.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

inserting a indexed table causing, "entered unreachable code: get_insert_state"

4 participants